Abstract
Population stratification (PS) is a primary consideration in studies of the genetic determinants of human traits. Failure to control for it may lead to confounding, causing a study to fail for lack of significant results or resources to be wasted following false positive signals. Here we review historical and current approaches for addressing PS when performing genetic association studies in human populations. We describe methods for detecting the presence of PS including global and local ancestry methods. We also describe approaches for accounting for PS when calculating association statistics, such that measures of association are not confounded. Many traits are being examined for the first time in minority populations, populations that may inherently feature PS.
Keywords: POPULATION STRATIFICATION, ASSOCIATION CONFOUNDING, GLOBAL ANCESTRY, LOCAL ANCESTRY, ADMIXTURE, ADMIXTURE MAPPING
KEY CONCEPTS
Definition and Causes of Population Stratification
As Homo sapiens geographic range expanded over time and groups left the site of their geographic origins in Africa [Vigilant, et al. 1991], they separated into subgroups and experienced novel stresses and environments. Geographic isolation, interbreeding, and adaptation differentiated human populations from each other [Schlebusch, et al. 2012].
Fossil and genetic evidence suggests that anatomically modern humans evolved in Africa about 150,000 to 190,000 years ago [McDougall, et al. 2005; White, et al. 2003] and expanded into a diverse array of niches there, providing Africans with the highest level of genetic diversity among current human continental populations [Rosenberg, et al. 2002; Tishkoff, et al. 2009]. Humans subsequently migrated into Europe, Asia, and the Americas in an approximately West-to-East pattern that began approximately 50,000 – 100,000 years ago [Gravel, et al. 2011; Harris and Nielsen 2013; Li and Durbin 2011; Mallick, et al. 2016] and concluded with the settlement of South America sometime in the last 15,000 years [Jenkins, et al. 2012]. The features of this scenario are increasingly complex, as understanding of hominin origins are updated regularly by increasingly sophisticated studies of modern populations, discoveries of ancient DNA specimens, and archaeological artifacts. A recent review by Nielsen et al. covers this history and the evidence that supports it [Nielsen, et al. 2017].
Among the effects of this period of colonization and the migrations during and afterward, as well as mating between populations of humans and other hominins [Green, et al. 2010; Meyer, et al. 2012; Vernot and Akey 2015], are differences across populations in allele frequencies throughout the genome. These differences, however they arise, are detectable in studies of human populations and provide information about both demographic history and geographic origins in modern humans [Novembre, et al. 2008; Wang, et al. 2012a]. This state, where populations are distinguishable by observing genotypes, is referred to as population structure or population stratification (PS).
PS may confound associations between genotype and the trait of interest in a genetic study. When PS exists, false positive or negative associations between genotype and trait may arise from differences in local ancestry that are unrelated to disease risk or trait variance. A consequence of PS, genetic admixture, arises from interbreeding of ancestral groups. A common example of genetic admixture is the African American population, which has both African and European ancestry. These factors must be considered in study designs and accounted for statistically in order for results of genetic association studies to be reliable. In this Unit, we will discuss the causes of PS and its history in genomic investigations, methods for observing global and local ancestry within a population, and techniques to account for and leverage differences in ancestry within genetic association studies.
PS is caused by non-random mating and most often arises due to geographic isolation of subpopulations with low rates of migration and gene flow over the course of several generations (Hartl and Clark, 2007). The geographic separation of these isolates allows for divergent random genetic drift due to sampling error in the set of parental alleles, which is subsequently propagated through successive generations. As a result, allele frequencies change randomly over time as an independent process for each population isolate, ultimately causing observable differences in the frequency of many alleles after several generations of separation and differentiation.
This scenario also introduces the possibility of selection for different traits in different geographic regions. A classic example of selection is hypolactasia, or lactase intolerance, a trait which prevents individuals from metabolizing the milk sugar lactose into adulthood through decreased production of the lactase enzyme [Bayless, et al. 2017]. One of the first genetic variants found to be associated with hypolactasia in humans, rs4988235, resides not in the lactase gene LCT, but rather in an enhancer region within an intron of another gene, MCM6, approximately 15kB upstream of the LCT promoter [Enattah, et al. 2002; Lewinsky, et al. 2005; Olds and Sibley 2003]. A two-variant haplotype including rs4988235 and rs182549 explains 77% of hypolactasia variance in Europeans, but does not explain the trait distribution in individuals of African ancestry [Mulcare, et al. 2004]. Multiple studies have discovered additional variants including rs145946881, rs41380347, and rs41525747 that explain the distribution of hypolactasia in Africans, all of which reside in the same enhancer region as the European variant rs4988235 [Friedrich, et al. 2012; Ingram, et al. 2007; Ingram, et al. 2009; Tishkoff, et al. 2007]. Age estimates for the European hypolactasia variant rs4988235 range from 2,188 to 20,650 years ago [Bersaglieri, et al. 2004]. Similarly, age estimates for the African variant rs145946881 range from 1,200 to 23,200 years ago [Tishkoff, et al. 2007]. An empirical example of PS is the spurious association between LCT and height in a case-control study of European American population [Campbell, et al. 2005]. A single nucleotide polymorphism (SNP) in LCT showed strong association (p-value < 10−6) with height without addressing PS. No significant association was detected between the SNP and height after correcting for PS.
Larger, more ancient gene pools, such as African ancestry, have a greater amount of overall variation and a finer linkage disequilibrium (LD) structure between markers [Goddard, et al. 2000]. Maximum ability to differentiate populations comes from genetic markers with large frequency differences among the parental populations for admixed samples. These markers, often SNPs, are known as ancestry informative markers (AIMs). AIMs are frequently incorporated into genotyping experiments when PS is suspected for downstream conditioning on inferred ancestral information in association modeling [Pritchard and Donnelly 2001].
The differentiation among subpopulations is detectable even when the regional differences are subtle, as has been described in Chinese and Japanese and European populations [Gao and Starmer 2007]. Cultural differences among populations also create stratification, even when populations inhabit the same geographical region. An example of this is the detectable differences among populations that speak Khoesan languages that include click-consonants from non-Khoesan speaking peoples who occupy the same geographic range [Tishkoff, et al. 2009]. Recent evidence shows that, even after correction for ancestry inferred from common genetic factors such as AIMs, subtle uncorrected population substructure persists in some genomic studies [Bhatia 2016].
Measures of genetic differentiation
There are several measures of genetic differentiation to evaluate the relationship of subpopulations to one another. One of the classical approaches is the fixation index (Fst), which compares the differences in expected heterozygosity across populations under Hardy-Weinberg Equilibrium [Weir and Cockerham 1984; Weir and Hill 2002; Wright 1921]. The drift toward fixation in isolated groups results in a loss of heterozygosity in the total population, which is known as the Wahlund effect [Wahlund 1928]. Specifically, Fst quantifies the proportional impact the subpopulations have on the heterozygosity estimate relative to the situation where there was no population structure. An expression for Fst relating the expected heterozygosity under Hardy-Weinberg Equilibrium H of a single marker in the subpopulation s, denoted Hs, to the total Ht is Fst = Ht−Hs/Ht. Average Fst across a set of unlinked markers is a standard metric for assessing population genetic differentiation. Smaller Fst indicates similar allele frequencies between populations, while larger values mean that the allele frequencies are different [Holsinger and Weir 2009]. Sewell Wright suggested the following guidelines for interpreting values of Fst: 0–0.05 indicates little differentiation, 0.05–0.15 indicates moderate differentiation, 0.15–0.25 indicates great differentiation, and greater than 0.25 indicates very great differentiation.
Because the effects of alleles on traits detected in genetic studies are usually subtle, relatively small levels of differentiation can confound tests of association. Factors that can accelerate the rate of differentiation at a locus are small subpopulation size, inbreeding, selection, and mutation. Some factors that slow the rate of differentiation are migration and gene flow between subpopulations and large population size. Approaches for using Fst for estimating migration rates, inferring demographic history, identifying genomic regions under selection, forensic science and association mapping, and a discussion of the relationship with coalescent theory were reviewed by Holsinger and Weir [Holsinger and Weir 2009]. Further, observed Fst across human subpopulations have also been reported [Steele, et al. 2014].
Another quantification of the differences between population samples is the allele sharing distance (ASD) [Gao and Martin 2009; Gao and Starmer 2007]. ASD is a pair-wise measure among subjects across a large set of markers, and is defined by the expression, where dl = 0 if two individuals have two alleles in common at the l-th locus; dl = 1 with one allele in common, and dl = 2 when there are no alleles in common. The relationship between ASD and the closely related identical by state (IBS) has been described by Miclaus et al [Miclaus, et al. 2009].
Admixture and Admixture Mapping
Although it simplifies the description of PS to imagine the allele frequencies of distinct subpopulations randomly drifting away from each other over time, populations also tend to mix. This is known as admixture, and at the first generation after two distinct populations begin mixing, these offspring have half of their genetic material from each of the maternal and paternal populations. In subsequent generations, average ancestral proportions in offspring vary according to the composition and rates of genetic exchange among the ancestral populations.
African Americans are a classic example of this, where approximately 80% of the genome is derived from African ancestors and 20% from European ancestors at autosomes, and there are greater proportions of African-derived X chromosomes due to historically skewed transmission to offspring from European males and African females [Bryc, et al. 2010].
Examples of Population Stratification in Genetic Studies
As a simple numeric example of PS, suppose some data are collected as listed in Table 1. In population 1, the cell frequency (case, allele A) is 0.27, which is equal to the product of the marginal frequencies 0.3*0.9. This relationship holds for population 2, i.e. 0.08 = 0.8*0.1. Therefore, no association exists between marker alleles and case-control status. However, in the pooled data of population 1 and 2, the cell frequency for (case, allele A), 0.175, is no longer equal to the product of the marginal frequencies 0.55*0.5 and a chi-square test with one degree of freedom is significantly association with p-value < 0.0001. Therefore, even though there is no association in either population 1 or 2, a false positive association exists in the pooled population.
Table 1.
A numeric example of a false positive association due to population stratification.
| Population | Allele | Phenotype | Total | Association | |
|---|---|---|---|---|---|
|
| |||||
| Case | Control | ||||
| 1 | A | 270 | 30 | 300 | no |
| B | 630 | 70 | 700 | ||
| Total | 900 | 100 | 1000 | ||
|
| |||||
| 2 | A | 80 | 720 | 800 | no |
| B | 20 | 180 | 200 | ||
| Total | 100 | 900 | 1000 | ||
|
| |||||
| Pooled | A | 350 | 750 | 1100 | Yes |
| B | 650 | 250 | 900 | P < .0001 | |
| Total | 1000 | 1000 | 2000 | ||
Confounding due to PS resulting in spurious genotype-phenotype associations is well-documented. A classic example is a study by Knowler et al. that describes an association between a polymorphism in the immunoglobulin Gm system, Gm3:5,13,14, and type 2 diabetes in Native Americans recruited from the Gila River Indian Community in southern Arizona [Knowler, et al. 1988]. Gm polymorphisms have different frequencies between ancestry groups [Brucato, et al. 2009; Schanfield and Kirk 1981; Williams, et al. 1985]. Knowler et al. showed that Gm3:5,13,14 was not a causal genetic factor in the development of type 2 diabetes, but that the observed association was confounded by admixture between Native American and European ancestry groups [Knowler, et al. 1988]. After adjustment for admixture proportions, the association was no longer statistically significant.
The spurious association of markers that are highly variable between ancestry groups is not uncommon. Choudry et al. analyzed AIMs for association with asthma in two admixed Latino populations, Mexicans and Puerto Ricans, which have the highest and lowest asthma morbidity, mortality, and prevalence rates among all US populations, respectively [Choudhry, et al. 2006; Homa, et al. 2000; Moreno-Estrada, et al. 2013]. Of all 44 AIMs tested, eight were significantly associated with asthma, but only two remained significant after adjustment for PS.
Some populations have very complex recent demographic histories that must be accounted for in statistical analyses. For example, the Brazilian population is made up of individuals with varying proportions of African, Native American, and European ancestry [Pena, et al. 2011]. Skin color is poorly correlated with genetic ancestry in the Brazilian population and therefore self-reported race can be inaccurate for genetic studies [Pena, et al. 2011]. Early genetic studies of type 1 diabetes (T1D) in Brazilians reported geographic variability in HLA-DR and HLA-DQ allele frequencies, two genetic loci strongly associated with T1D in Europeans [Silva, et al. 2008; Thomson, et al. 2007]. In a study accounting for PS, Gomes et al. identified a novel protective haplotype DRB1*10-DQB1*0501 [Gomes, et al. 2017].
Accounting for PS in candidate gene studies is challenging due to the lack of genome-wide coverage of genetic factors from which ancestry may be inferred. A classic example is the observed association between a restriction fragment length polymorphism (RFLP) upstream of the insulin gene INS and T1D [Bell, et al. 1984]. Replication of this association was consistently reported in population-based studies across several ancestries, but no evidence of linkage was detected in family studies [Spielman, et al. 1989]. These findings initially suggested that the observed association was the result of confounding due to PS. However, implementation of the transmission disequilibrium test (TDT), a linkage method that incorporates family member controls and is robust top PS, detected strong evidence of linkage between the RFLP and T1D [Spielman, et al. 1993]. The failure of previous family-based genetic studies to detect linkage between the RFLP and T1D was likely due to a lack of power to detect variants with modest effects. Recent studies have shown that as few as 30 AIMs are sufficient to accurately estimate ancestry proportions in African American populations, suggesting that modest numbers of AIMs are adequate in more complex populations [Kodaman, et al. 2013; Ruiz-Narvaez, et al. 2011].
Patterns of PS may also provide insights into demographic histories in admixed populations. An analysis of 128 AIMs in the Cuban population showed a large European paternal contribution and a large Native American and African maternal contribution [Marcheco-Teruel, et al. 2014]. These contributions are concordant with the historical context of male European settlers mating with Native American females during the early stages of colonization, and later mating with African females during the period of transatlantic slave trade [Benn-Torres, et al. 2008; Mendizabal, et al. 2008]. Similarly, analysis of genetic data from 23 and Me (23 and Me Inc., Mt. View, CA), a direct-to-consumer genetic testing company, shows evidence of sex-biased gene-flow in the U.S. reflective of early colonization by and subsequent immigration of European populations [Bryc, et al. 2015; Eriksson, et al. 2010; Tung, et al. 2011].
QUANTIFYING POPULATION STRATIFICATION
Global and Local Ancestry
Global ancestry
Many methods for working with PS estimate global parameters to summarize the ancestry of study subjects. These parameters are often useful for both PS detection and statistical control of confounding by PS. Depending on the questions addressed; methods to detect and quantify PS require genotype data from a handful of carefully selected genetic variants to a large number of genome-wide SNPs. A common question in PS detection is the number and type of SNPs needed to detect PS in a given context. Regardless of the statistical methods employed, the more similar two populations are, the more markers need to be evaluated to detect the differences.
If the study is performing small-scale genotyping, then AIMs may be the most cost-efficient way to quantify ancestry. This approach is only possible if AIMs have been identified a priori, as has been done for the reference populations from the International HapMap Project (http://www.hapmap.org). The 1000 Genomes Project [Genomes Project, et al. 2012] is also widely used for most populations, with the Haplotype Reference Consortium (HRC) panel [McCarthy, et al. 2016] becoming more commonly utilized recently. However, construction of population-specific reference sets through whole genome sequencing is becoming increasingly more common [Low-Kam, et al. 2016] (French-Canadian); [Tang, et al. 2016] (Australian Aboriginal; exome); [Higasa, et al. 2016] (Japanese); [Thareja, et al. 2015](Persian Kuwaiti); [Huang, et al. 2015] (UK10K, United Kingdom), [Kawai, et al. 2015] (1KJPN, Japanese); [Wong, et al. 2014] (South Asian Indians); [Kim, et al. 2014] (Korean); [Deelen, et al. 2014] (GoNL, Netherlands); [Carmi, et al. 2014] (Ashkenazi); [Wong, et al. 2013] (Asian Malays)), though some of these (i.e. UK10K and GoNL) have also been included in the HRC. Otherwise if genome-wide association study (GWAS) data are available, then 50,000 to 100,000 linkage disequilibrium (LD)-pruned SNPs may be used to estimate global ancestry.
Local ancestry
With the availability of GWAS data in admixed populations and advances in admixture mapping, several methods have been developed to classify ancestry in small chromosomal regions. Early methods evaluating local ancestry in admixed populations, including MALDsoft, STRUCTURE, and ANCESTRYMAP, were based on Hidden Markov Models (HMM) [Falush, et al. 2003; Hoggart, et al. 2004; Montana and Hoggart 2007; Patterson, et al. 2004; Zhu, et al. 2006].
Local ancestry estimates can be used as covariates in linear models on a SNP-by-SNP basis [Wang, et al. 2011]. Alternately, tests based on a conditional likelihood framework, which models the distribution of the test SNP given disease status and flanking marker genotypes, are also available [Wang, et al. 2011]. Alternatively, principal components analysis (PCA), multidimensional scaling (MDS), STRUCTURE, and other methods can provide estimates of global ancestry in that are useful for adjusting for PS in linear models. Local ancestry estimates the ancestral origin of chromosomes at a locus. While adjusting for global estimates is the most common approach and controls confounding in GWAS, residual confounding might lead to increased type II errors, and improvements in power have been noted for adjusting for local estimates [Wang, et al. 2011].
Global Ancestry Methods
Methods for estimating ancestry proportions
Direct evaluation of genetic ancestry proportions involves comparisons of sample data to reference allele frequencies are based on the use of AIMs. The necessary difference in allele frequency to differentiate two populations can vary depending on the number of AIMs and the genetic distance between populations. Often a 20% difference in allele frequencies is used to define AIMs. AIMs can be identified from published lists, or through empirical assessment of allele frequencies in the available genetic data. Fewer AIMs are required to quantify global ancestry when allele frequency differences are large. However, inclusion of more AIMs increases the precision of ancestry estimates. Accurate estimation of ancestry proportions is also dependent on the number of parental populations (designated K) assumed to contribute to the overall genetic ancestry of the target population. Most of the approaches described here can be implemented either defining K to the number of suspected subpopulations believed to be present in the data and providing those K reference datasets, or alternatively, selecting increasing values for K, and choosing the value of K where the likelihood of the data given K is largest. If the likelihood is largest at K=1, then there is no evidence for PS. Subjects who map into the known groups can then be identified as a member of that population or a related subpopulation. Several software packages exist for computation of global genetic ancestry proportion estimates, among them the most popularly adopted are STRUCTURE [Falush, et al. 2003; Pritchard, et al. 2000a], FRAPPE [Tang, et al. 2006] and ADMIXTURE [Alexander and Lange 2011; Alexander, et al. 2009].
STRUCTURE uses a Bayesian approach and relies on a Markov Chain Monte Carlo (MCMC) algorithm to jointly sample the posterior distribution of allele frequencies and fractional group memberships. STRUCTURE assumes that the data are comprised of mosaics of chromosomes from an arbitrary number of homogeneous ancestral subpopulations (K), and that each subpopulation is characterized by a distinct vector of allele frequencies. STRUCTURE is sensitive to non-random missing data, and running the software with enough iterations to ensure the convergence of the MCMC algorithm is also of concern when utilizing this method [Yang, et al. 2005]. In practice, between 1,000 and 5,000 iterations are sufficient for burn-in, and 5,000–10,000 are sufficient for estimation. Too little iteration will cost accuracy, while too much iteration will only cost computational run-time, so if in doubt, use of more iterations than necessary will yield accurate results.
One of the earliest approaches for controlling for global ancestry was Structured Association (SA) and is a two-step procedure [Pritchard and Donnelly 2001; Pritchard, et al. 2000b]. The first step uses markers that are not associated with any trait of interest to assign individuals to subpopulations and then test for association within those groups. The first step can be performed using the program STRUCTURE. The second step, the association testing, is performed using likelihood ratio tests. The most challenging issue in SA analyses is to estimate the correct number of subpopulations to condition on. STRUCTURE does provide a data-driven way to infer this number, by scanning over choices of K, and choosing the value that maximizes the likelihood of the data given K.
FRAPPE and ADMIXTURE each use a maximum likelihood estimate (MLE) approach and optimize the likelihood for both allele frequencies and fractional group memberships using an expectation-maximization (EM) algorithm, but ADMIXTURE uses a faster optimization algorithm. ADMIXTURE yields ancestry estimates with similar accuracy as STRUCTURE but uses less computing time, and has many of the same capabilities, including the ability to estimate the number of underlying ancestral populations, incorporate reference individuals of known ancestry to improve ancestry estimates, and penalize small ancestry estimates to improve model parsimony and avoid model fitting problems [Alexander and Lange 2011].
Newer software packages for calculating global ancestry proportion have been constructed to take sequencing-derived genotypes with uncertainty into account, as well as to construct population relationship trees from the data. NGSadmix [Skotte, et al. 2013] is an extension of the MLE framework built to accommodate genotype likelihoods often available from low-depth next-generation sequencing (NGS) data due to the uncertainty regarding the true genotypes. Although slower than ADMIXTURE, use of the genotype likelihoods outperform the hard-called genotypes, when sequencing depth was approximately even for all individuals and average depth was at least 0.5x. Ohana [Cheng, et al. 2017] is another new method for inferring admixture in an MLE framework which is applicable both to called genotypes and to NGS data, which also estimates population relationships using a Gaussian approximation. It selects the best covariance matrix compatible with a tree, thereby estimating a tree, and provides simple algorithms and visualization tools to obtain the evolutionary trees.
There are also software packages that explicitly map spatial differences in ancestry. These methods were developed to approximate the geographic location of admixed populations. SPA [Yang, et al. 2012] is a probabilistic model for the spatial structure of genetic variation, which explicitly models how the allele frequency of each SNP changes as a function of the location of the individual in geographic space (where the allele frequency is a function of the x and y coordinates of an individual on a map). This approach detects SNPs with steep geographic gradients in allele frequency that suggest the SNPs have been under selection. Geographic Ancestry Positioning (GAP) uses genotypes to infer local spatial distances and applies them to a global space [Bhaskar, et al. 2017]. This method has been extended to an association test, which uses an allele frequency smoothing technique using the spatial coordinates and incorporates that information to test each SNP in an inverse regression of the genotype against the trait, conditional on the estimated allele frequency.
Methods for observing and clustering ancestral groups
Several methods infer PS, such as MDS or PCA, using singular value or eigenvector decomposition. These methods use genome-wide level data to summarize genetic variance in variables that can be plotted to visualize genetic relationships between samples. When visualized alongside known reference populations, ancestral background, and therefore PS, among the sample can be identified. Other scenarios can also cause clustering of samples, including kinship and genotyping batch effects. Thus visualization of the components with reference population anchors is strongly recommended to ensure clustering by continental ancestry in reference populations is as expected. PCA is implemented in EIGENSTRAT [Price, et al. 2006], as well as MDS implemented in PLINK software [Purcell, et al. 2007]. The output from PCA and MDS is often very similar, as illustrated in Figure 1.
Figure 1. Principal components analysis.
These figures show the clustering results using principal components analysis implemented by the Eigensoft v3.0 software with 142,616 genome-wide random autosomal SNP loci from the HapMap project (Phase 3, release 3). Only the first three eigenvectors are shown.
Note: CEU, Utah residents with ancestry from northern and western Europe; YRI, Yoruba in Ibadan, Nigeria (West Africa); CHB: Han Chinese in Beijing, China; ASW: African ancestry in Southwest USA. ASW is an admixed population.
Recently several faster methods for computation of PCs have been developed that use randomized matrix algorithms and parallelized matrix multiplication. These methods include FlashPCA [Abraham and Inouye 2014] and FastPCA [Galinsky, et al. 2016] implemented in EIGENSOFT as fastmode option. FastPCA computational time scaled linearly with increasing sample size as opposed to other methods that have shown cubic or quadratic increases. Analysis of 100,000 individuals and 100,000 SNPs with FastPCA on a single computer required less than an hour and only 3.2GB memory, while flashPCA required nearly 10 hours and 40GB to compute across 30,000 samples. The LASER program [Wang, et al. 2014; Wang, et al. 2015] is designed to handle low-coverage sequence reads to perform PCA. When combined with genotype imputation, LASER 2.0 can accurately estimate fine-scale genetic ancestry, and is implemented on a web server (http://laser.sph.umich.edu/) [Taliun, et al. 2017].
The SNPweights method [Chen, et al. 2013] assigns weights to the individual SNPs in the analysis by population. Weights for SNPs are pre-computed in the reference panel and those weights can be applied to the sample to infer ancestry without having to gain access to the raw genotypes of the reference panel. This is similar to the approaches utilizing AIMs, however, this incorporates more SNPs which improves accuracy when genome-wide SNPs are available.
Principal Components Analysis with Related individuals (PC-AiR) [Conomos, et al. 2015] infers PS in the presence of related individuals. This method identifies an unrelated subset of individuals that represents the ancestral diversity of the sample and computes PCs in this subset and projects PCs onto the remainder of the sample. This approach does not require reference samples to be included for adequate performance, but does perform better when incorporating kinship coefficients when defined pedigree structure among samples is unknown.
Extensions of PCA have been developed to handle complex ancestral scenarios are also available. PCAmask [Moreno-Estrada, et al. 2013] and subspace PCA (ssPCA) [Johnson, et al. 2011] were developed to address the complex recent admixture of indigenous and Native American populations. These approaches analyze genomic segments consistent with a single inferred continental population (virtual genomes). The PCAmask approach extends upon ssPCA by utilizing phased haplotype data, allowing use of genomic regions that are ancestrally heterozygous.
Genomic Control
Another popular PS method is Genomic control (GC), which controls the inflation of test statistics and can also be used to detect PS. It was developed for dichotomous traits [Devlin, et al. 2004; Devlin and Roeder 1999; Devlin, et al. 2001] and then extended to quantitative traits [Bacanu, et al. 2002]. At least 100 uncorrelated SNPs should be genotyped for GC, and these SNPs should not be associated with the trait of interest. The goal of GC is to quantify the bias in the data, either due to confounding, experimental errors, cryptic relatedness, or other causes. When SNPs in the GC set are associated with the trait, then their test statistics represent the alternative hypothesis and appear biased compared to the distribution expected under the null hypothesis. Thereby, if non-null SNPs are used to calculate the GC correction, then the correction will be conservative and associations will be more difficult to detect.
GC adjusts the observed distribution of the test statistic Y for tests of association between these null markers and the trait. Under the null hypothesis of no association, the Armitage trend test for association of SNPs with traits is asymptotically equal to a chi-square distribution. When there is PS, the test statistic is inflated by a factor, λ. Therefore, the statistic (Y) results from the inflation of the Armitage trend test, which can be written as . λ is then calculated as λ̂ = median(Y1, Y2, …, YL)/0.4549 or λ̂ = mean(Y1, Y2, …, YL)/1 since the median and mean of χ12 are 0.4549 and 1, respectively. By estimating λ from the unassociated SNPs and using Yi/λ to calculate p-values in place of Yi, for i markers, the effect of PS on p-values will be removed, reestablishing the distribution over a large number of SNPs.
Additionally, λ̂ provides a convenient quality control statistic for assessing whether association tests for GWAS data are confounded. This is done by checking if λ̂ is much different from 1 in the lower 90% of ranked test statistics, where smaller p-values are excluded to avoid apparent inflation due to true associations of SNPs with traits. Large values for λ̂ (values of λ̂ < 1.05 are considered benign, and values of 1.2 or more are tolerated for very large studies of highly polygenic traits with sample sizes of hundreds of thousands of participants) indicate that tests of association are confounded by some phenomena, which may include PS.
Additionally, other types of systematic differences between the data from groups of subjects can also cause large λ̂, such as nonrandom genotyping error that might arise due to merging GWAS data from different genotyping experiments, nonrandom differences in DNA quality between study samples, or other unmeasured confounders. This procedure is implemented in PLINK and is straightforward to calculate with any statistical software [Purcell, et al. 2007].
GC is reported to be ineffective if too few loci (< 100) are used and may decrease power if too many loci (> 500) are used [Marchini, et al. 2004]. Recent GWAS studies usually use λ calculated from genome-wide SNPs as an important post-analysis diagnostic statistic, and to protect against excess type I error. There is substantial variation in estimates of λ that depend on the set of markers chosen, and this may also decrease power if PS is extreme [Kohler and Bickeboller 2006; Zhang, et al. 2008]. GC can also be conservative if AIMs are used instead of random markers [Epstein, et al. 2007].
An alternative approach, GCF, which is a modified version of GC that uses the F distribution, has been shown to be more appropriate than GC in some extreme examples of PS [Dadd, et al. 2009; Devlin, et al. 2004]. GC also does not correct effect size estimates, although it can be used to correct confidence intervals, and as a result odds ratios or linear regression coefficients will be unreliable after GC is applied, even though the test statistics and p-values have been adjusted.
In scenarios where meta-analysis is being performed across several GWAS, GC corrections can be performed within each study and then again in the meta-analysis results. This double GC correction adjusts the set of test statistics across all markers within each study by a GC inflation factor. It then calculates a combined statistic across studies at each marker, and adjusts all combined statistics across the genome by the corresponding GC inflation factor. It has been suggested that PCA correction is more effective than the double GC correction in meta-analysis [Wang, et al. 2012b]. In the case where population stratification exists, using the double GC method usually results in much lower power than using the PCA correction in meta-analysis, even when the causal SNP does not have significant allele frequency differentiation in the subpopulations.
LD Score Regression
LD Score Regression [Bulik-Sullivan, et al. 2015] is a method utilizing summary association statistics from a GWAS to determine whether inflation of the test statistics is due to a true polygenic signal or bias. LD scores are computed in a sequenced reference panel with similar LD structure as in the GWAS by calculating the strength of tagging by SNPs within a 1CM window. LD score regression can be used to estimate the mean contribution of confounding bias to the inflation in the test statistics; i.e. to indicate post-hoc whether there is residual cryptic relatedness or population stratification remaining in the dataset. However, the model assumes that there is no systematic correlation between Fst and LD Score, which may not be the case when there is selection. It was demonstrated that the average LD Score regression intercept was approximately equal to the λ in simulations with PS. Because λ increases with sample size in the presence of polygenicity, the gain in power obtained by correcting test statistics with the LD Score regression intercept instead of λ will become even more substantial for larger GWAS.
Subtle Stratification (PC Loading regression)
An approach for correcting residual inflation of test statistics is PC loading regression [Bhatia 2016], that integrates the concept of weighting SNPs according to their contribution to PCs (i.e. total genetic variance) and also incorporates rare variant haplotypes. These rare variants are often omitted from PCA during the LD-pruning process. The slope of this PC loading regression provides an estimate of the magnitude of PS. It has been suggested that rare haplotypes can better capture subtle PS [Bhatia, et al. 2016], a concept which has been leveraged in several fundamental approaches in human genomics (i.e. rare variant enrichment in extended pedigrees [Browning and Thompson 2012], the continued relevance of linkage analysis [Ott, et al. 2015; Teare and Santibanez Koref 2014], haplotype length investigations for selection [Lappalainen, et al. 2010].
Cryptic relatedness and population stratification
As genetic studies have grown larger over time, so has the likelihood of recruiting related individuals or those who share extended relationships unbeknownst to the investigators. These cryptic relationships can also influence association statistics much like PS. The relatedness between two individuals is most frequently expressed in terms of the probability that they share zero, one or two alleles that are inherited identical-by-descent (IBD). However, as sample sizes have increased, construction of IBD matrices has become more computationally intensive, and determining whether to use relatedness as an exclusion process or to model it explicitly has been a subject of much debate. While PCA may detect and account for some relatedness, it may not adequately control this or may result in loss of power beyond those methods directly accounting for these relationships. Use of mixed models to account for cryptic relatedness has been one popular strategy for retaining as many samples as possible, as has reconstruction of pedigrees using IBD information. Mixed models have been shown to have better overall performance than PCA in the presence of association [Wang, et al. 2013].
PS can be thought of as a special case of cryptic relatedness, where participants who share parental ancestral populations are more closely related to each other than they are to participants who arise from different populations. In that conceptual framework for PS all participants in the study are connected by a large latent pedigree, with the ancestors that connect them unobserved. The number of meiosis that separate closely related people are small, are larger for distantly related people from the same population, and are much larger for pairs of people from distinct continental populations with long coalescence times. A group of methods that are designed to leverage this property of genetic data for quantitative traits are the linear mixed models, which can mitigate both the issues that arise when there is cryptic relatedness [Voight and Pritchard 2005] and PS in association studies in one procedure [Kang, et al. 2010; Kang, et al. 2008; Listgarten, et al. 2012; Zhou and Stephens 2012]. These approaches were originally developed for model organism studies in multiple inbred and outbred lines where many spurious results were initially observed, but were then non-significant after application of the mixed model methods. In a recent review of mixed model analyses, Martin and Eskin describe the formulation of these analyses, and show that mixed models produce smaller λ̂ statistics than PCA correction or removal of related subjects for a range of quantitative traits in a structured population from Finland [Martin and Eskin 2017].
In addition to confounding genotype-phenotype associations, PS may also distort estimates of trait heritability. Heritability for a particular trait may be described as the proportion of trait variability explained by genetic variants. Though historically measured in family studies, newer methods have been developed to estimate heritability from genome-wide data in population-based studies [Lee, et al. 2011; Yang, et al. 2010; Yang, et al. 2011]. However, Dandine-Roulland et al. warn that model adjustment for ancestry inferred from genomic data does not adequately correct for PS bias in population-based heritability estimates [Dandine-Roulland, et al. 2016].
Local Ancestry Methods
Local ancestry methods are used to identify ancestral origins of chromosomal regions. In two-way admixture, such as African Americans, any given genetic locus will have exactly 0%, 50% or 100% European derived alleles corresponding to 0, 1, or 2 copies. Accurate inference of local ancestry depends on number of generations since the admixture event, number of admixture events, number of ancestral populations involved across admixture events, and availability of reference data that represent the ancestral populations. Methods for local ancestry inference may be divided into two broad classes: 1) methods which do not model linkage disequilibrium and 2) methods that leverage linkage disequilibrium (LD). Key points regarding each software discussed here are summarized in Table 2.
Table 2.
Local ancestry inference software
| Program | Framework | Models LD/ Haplotype |
Ancestral Population Input Format | Three-way admixture | Closely Related Populations* |
Link for download |
|---|---|---|---|---|---|---|
| MaLDsoft/STRUCTURE | HMM-MCMC | No | - | Yes, but higher error rates | No | https://web.stanford.edu/group/pritchardlab/structure.html |
| ADMIXMAP | HMM-MCMC | No | Ancestral Allele Frequencies | Yes, but higher error rates | No | http://homepages.ed.ac.uk/pmckeigu/admixmap/ |
| ANCESTRYMAP | HMM-MCMC | No | Ancestral Allele Frequencies | Two-way admixture | No | https://reich.hms.harvard.edu/software |
| ADMIXPROGRAM | HMM-ML | No | Ancestral Allele Frequencies | Two-way admixture | No | Available on request from Authors |
| SABER | MHMM | Yes, diplotyes | Phased Reference Populations | Two-way admixture | No | http://med.stanford.edu/tanglab/software/saber.html |
| HAPAA | Heirarchichal HMM | Yes | Phased Reference Populations | Two-way admixture | No | http://hapaa.stanford.edu |
| HAPMIX | MHMM-MCMC | Yes | Phased Reference Populations | Two-way admixture | No | https://reich.hms.harvard.edu/software |
| LAMP/LAMP-ANC | Sliding Window, ICM | No | Not Required/Anestral Allele Frequencies | Two-way admixture | No | http://lamp.icsi.berkeley.edu/lamp/ |
| WINPOP | Adaptive Sliding Window | No | Ancestral Allele Frequencies | Two-way admixture | Yes | http://bogdan.bioinformatics.ucla.edu/software/lamp/ |
| LAMP-LD/LAMP-HAP | Window + HMM | Yes | Phased Reference Populations | Three-way admixture | Yes | http://bogdan.bioinformatics.ucla.edu/software/lamp/ |
| RFMix | Random Forest | Yes | Phased Reference Populations | Three-way admixture | Yes | https://sites.google.com/site/rfmixlocalancestryinference/ |
HMM = Hidden Markov Model; MCMC = Markov Chain Monte Carlo; ML = Maximum Likelihood; MHMM = Markov Hidden Markov Model; ICM = Iterated Conditional Mode;
Infers ancestry accurately for closely related populations such as CHB-JPT
Methods that do not model LD
Early methods for local ancestry inference included STRUCTURE/MaldSoft, ADMIXMAP, ANCESTRYMAP, and ADMIXPROGRAM are based on variations of first-level HMM where the goal is to make inferences on a series of hidden states (local ancestry) based on observable states (alleles and allele frequencies from ancestral populations) [Falush, et al. 2003; Hoggart, et al. 2004; McKeigue 1998; Montana and Hoggart 2007; Patterson, et al. 2004; Zhu, et al. 2006]. A key assumption of the HMM models are that the observed states, or alleles, are independent of each other, conditional upon the hidden states, the ancestry source for each allele. These methods rely on unlinked AIMs. These early methods are able to infer continental ancestry throughout the genome (African and European), with the resolution limited by the number of independent AIMs available and computational tractability.
The Local Ancestry in admixed Populations (LAMP) method uses sliding windows of contiguous independent SNPs to infer local ancestry[Sankararaman, et al. 2008]. It first calculates an optimal window length such that the probability that a given window has a recombination event is small and assumes that alleles in the window are derived from only one ancestry. It then uses a clustering algorithm known as Iterated Conditional Modes (ICM) on each of these windows to infer ancestry on each marker on the window, followed by a majority vote across overlapping windows to call ancestry. Advantages of this method over previous methods include: faster run times, capability of handling GWAS data, improved accuracy, ability to infer local ancestry even in the absence of ancestral reference data, and incorporation of ancestral allele-frequency data (LAMP-ANC) when it is available for even more accurate predictions. These methods are optimized to make ancestry calls for admixed populations with two-way admixture between distant ancestral populations such as Africans and Europeans. These methods are inaccurate if inferences are made on closely related populations.
WINPOP modifies and extends the LAMP method to provide inference of local ancestry not only in admixed individuals with distant ancestry, but also between closely related populations [Pasaniuc, et al. 2009]. It uses a sliding window like LAMP with two important distinctions: 1) it adaptively determines window length for each location and 2) it allows for up to one recombination event to occur within each window. The method provides more accurate results than LAMP, and LD-based methods such as SABER and HAPAA (discussed below) across distant and closely related two-way admixtures [Sankararaman, et al. 2008; Sundquist, et al. 2008; Tang, et al. 2006]. The greatest gains in accuracy were reported to occur in closely related populations. Despite this improvement, the method has up to 91% accuracy for closely related populations.
Methods that model LD
LD based methods for local ancestry inference assume that there may be haplotypes unique to a given population. SABER is one of these methods and uses a ‘Markov-switching model,’ also known as Markov Hidden Markov Models (MHMM)s [Tang, et al. 2006]. Previous HMMs were incapable of handling LD between markers as modeling haplotypes within ancestral populations in the HMM framework would be computationally intractable. A similar approach, HMM-based Analysis of Polymorphisms in Admixed Ancestries (HAPAA), uses hierarchical HMMs to model LD, displays lower error rates than SABER, and also has features that evaluate the effect of genetic divergence between ancestral populations and time-to-admixture [Sundquist, et al. 2008].
HAPMIX is a haplotype-based HMM method that achieves high accuracy and has a strict assumption of two ancestral populations [Price, et al. 2009]. It utilizes the population genetic model of Li and Stephens and phased haplotypes from unadmixed ancestral populations as references to infer local ancestry [Li and Stephens 2003]. HAPMIX, like HAPAA uses HMM to explicitly model LD to make local ancestry inference with a few key differences. It allows some margin of error for miscopying ancestry segments from the wrong population. It also allows for unphased data for the admixed population and attempts to account for phase-flip errors on ancestry inference. These features along with the use of dense SNPs allows it to make inference on smaller stretches of chromosome, which is where ancient admixture is likely to be detected. However, the requirement for phased haplotypes from unadmixed ancestral populations and the specification of many parameters limits its use in less-studied populations.
LAMP-LD improves on existing methods by proposing a model of local ancestry inference that extends to multi-way admixed populations with significantly reduced error rates [Baran, et al. 2012]. Like HAPAA and HAPMIX it models the haplotype structure using HMMs, but with a fixed-size state-space. This is the only method that uses a fast-approximation of the Li and Stephens model to realize ancestral haplotype structures. Additionally, LAMP-LD estimates its parameters from the reference haplotypes rather than relying on user-specification which greatly reduces parameter misspecification. Furthermore, it combines the window-based method originally developed for LAMP with an HMM method that relaxes the no-recombination limitation, improving speed and accuracy in three-way admixed populations. Another extension to LAMP-LD is LAMP-HAP, which further leverages pedigree information from trio data to provide local ancestry estimates with greater accuracy.
RFMix departs from the HMM-extension framework discussed above to a discriminative approach to explicitly models ancestry along an admixed chromosome given known reference haplotypes or inferred ancestry [Maples, et al. 2013]. In the inference algorithm for RFMix, phased reference chromosomes are first divided into windows of equal sizes by genetic distance. A random forest is then trained within each window to classify ancestry. The random forest is then applied to the corresponding window of admixed chromosome to generate fractional votes, which are then summed to generate posterior probabilities for ancestry within each window. Posterior probabilities from consecutive windows are then put through max marginalization of the forward-backward posterior probabilities to infer the most likely sequence of ancestry across windows. The method is faster than LAMP-LD or LAMP-HAP and provides more accurate estimates of local ancestry. An important feature of this software that it is accurate even when reference data is limited. The algorithm is also able to incorporate inferred ancestry segments from the admixed chromosome to further augment the training set in an iterative process.
Admixture mapping
Admixture mapping can be used to identify disease causing loci in admixed populations. Admixture mapping is an ideal approach for studying diseases with differential prevalence across ancestral populations where the disparity is heritable. Methods for admixture mapping have been covered extensively in the following review articles [Seldin, et al. 2011; Shriner 2013; Winkler, et al. 2010]. Briefly, case-only and case-control admixture mapping strategies have been widely used in the past. While case-only admixture mapping strategies can provide greater sensitivity in detecting disease loci, they are also particularly prone to false-positive signals. Case-control admixture mapping strategies provide a stronger control of type I error. Both case-only and case-control admixture strategies have advantages over GWAS for multiple testing. Because ancestral LD blocks tend to be much longer than short-range LD, the number of independent tests is drastically reduced with admixture mapping.
In addition to the traditional admixture mapping strategies, at least two joint test frameworks leverage local ancestry to increase power for gene discovery. The first joint method is implemented in Mixscore, which combines a case-only admixture test statistic with a SNP association test into a single one-degree of freedom chi-square test [Pasaniuc, et al. 2011]. This test is more powerful for discovery than the Armitage trend test, the SNP association test while conditioning on local ancestry, case-only association test, and also the two-degree of freedom chi-square joint-test for the sum of SNP and admixture mapping association test.
Another joint test is the BMIX that uses a Bayesian framework to model posterior probabilities from admixture mapping as prior probabilities for association testing to reduce multiple testing [Shriner, et al. 2011]. In simulations the authors show BMIX to be more powerful than the Mixscore approach.
STRATEGY
A summary of most methods described in this article is presented in Table 2, with an outline of capabilities and limitations for each approach.
Investigations of genetic traits in humans are observational studies where researchers do not perform mating experiments, control the environment for the organism, or induce mutations such as in experimental studies with model organisms. As a result of the observational nature of the research, as for any epidemiologic investigation, care must be taken when planning the ascertainment of subjects and the statistical analysis of the genetic data to control for confounders.
One of the most common and important considerations regarding potential confounding in human genetic epidemiology is whether a sample of subjects under study includes persons of mixed ancestry or groups of subjects with distinct ancestral backgrounds. When there is a difference across ancestral groups in the probability of ascertaining a subject with the phenotype of interest or a difference in the distribution of a quantitative trait between ancestral groups, then any genetic variant with a difference in allele frequency across ancestral groups might seem to be associated with the trait if tests of association are carried out without accounting for ancestry [Freedman, et al. 2004]. Failure to adjust for PS properly can lead to excess false positive results or cause loss of power [Cardon and Palmer 2003; Marchini, et al. 2004]. As a result of the associations of alleles with ancestry, the degree of confounding is related to the sample size of the study, such that larger studies are more acutely affected [Marchini, et al. 2004].
The design of a genetic study can involve one of several sampling strategies and stages. The sampling approach is most likely determined by properties of the trait and the availability of existing studies with biological specimens from the study subjects. For example, when studying a trait with an onset that is typically early in life it may not be feasible to recruit large numbers of healthy unrelated control children, since the parents of healthy children are often less motivated to participate in research than parents of cases. As a result a family-based design utilizing the TDT or a related statistic may be more efficient. Conversely, for a trait with an onset late in life, other relatives in the family may not be available, and so a case-control study may be easiest to conduct. For a case-control study, an ideal sample of controls would have the same potential as the cases for exposure to risk factors, and if the controls had manifested the trait they would be selected as cases for the study. This principal is violated when there is PS in the data that is also related to the trait of interest through a difference in prevalence for the trait and alleles at many loci in the parental populations.
When designing a genetic study, some effort should be expended to identify the ancestry of subjects before genotyping commences. For example, an option is to require all members of the study to report the ethnicity of all four grandparents for eligibility and crude quantification of ancestry [Velez, et al. 2008]. However, in certain situations this may misclassify an individual’s actual ancestral background if ancestry is a cultural rather than a genetic classifier, as is sometimes the case in Hispanic populations.
One aspect of investigating traits in admixed samples is the difficulty of performing replication studies. Once an association with a particular marker has been observed, a second round of genotyping is usually performed in an independent sample of subjects to verify the signal. To account for PS using global estimates of ancestry, several dozen AIMs may be necessary. This could increase the cost of the replication study by several times, limiting the sample size that may be investigated and consequently the chances of successful replication. When there are a small number of SNPs of interest to replicate, we advocate using local ancestry estimates from the candidate marker and several nearby flanking AIMs to adjust for PS. This issue arises in consortium studies of GWAS data in admixed populations, and can be a challenge to coordinate replication efforts using global estimates of ancestry in previously ungenotyped subjects. Alternatively if participants from non-admixed parental populations are available, they may be analyzed without adjustment for PS [Franceschini, et al. 2013; Monda, et al. 2013].
Another consideration is whether the plan for genotyping accommodates the idiosyncrasies of the study sample. If the study design is investigating candidate genes, then a panel of approximately 30 AIMs may be necessary to quantify global ancestry in African Americans, and more in populations with more complex demographic histories. Alternately, nearby AIMs not in LD with the gene regions may be added to the genotyping panel to call local ancestry with a method such as LAMP or HAPMIX. Both of these designs require that suitable reference panels of genotypes are available from the appropriate ancestral populations. If this is the case, then an agnostic method such as STRUCTURE might be used, with a scan through possible values for the number of ancestral subpopulations. If GWAS data are being generated, then MDS or PCA can be applied to summarize continuous axes of ancestral variation and adjust for confounding by PS.
Proper use of these methods requires a working understanding of population genetics principles and association statistics for genetic epidemiology. One of the most important considerations; however, is the study design employed and how that design will work in concert with the analytic methods to produce reliable results.
COMMENTARY
In this article we focused on PS methods and their applications in human disease mapping. In addition, many of the methods we present here are also used in experimental populations of animals and plants, agriculture, and ecology [Bomblies, et al. 2010; Galvan, et al. 2011].
PS is an extensively studied area of research. Other than the general PS methods mentioned previously, some PS methods are designed for some special situations. For example, some methods have the ability to conduct association tests for a combination of pedigree and unrelated samples while correcting for PS [Chung, et al. 2010; Thornton and McPeek 2010; Zhu, et al. 2008]. There are also several early methods that used coarse sets of genetic markers that specifically target admixed populations [Hoggart, et al. 2003; Montana and Pritchard 2004; Patterson, et al. 2004]. Interested readers in admixture mapping and population stratification in general can consult other recent reviews on disease mapping in admixed populations [Astle and Balding 2009; Price, et al. 2010; Seldin, et al. 2011].
Genetic research is expanding into more diverse populations, and PS will continue to be important in human genetic studies. It is also becoming clear that the rare alleles carried by each population are unique, and traits may have distinct etiologies in various human populations [Gravel, et al. 2011]. This may be the cause for the apparent failures of some previously observed associations to replicate when the associated SNPs are assayed in other populations. Other causes that are related to the differences between populations are also likely to cause apparent failure to replicate at specific SNPs, such as differences in LD, environmental exposures, and different frequencies of genetic modifiers.
Table 3.
Global ancestry methods and software
| Program | Method | Function | Link for download |
|---|---|---|---|
| Eigensoft | PCA | Calculate PCA from genotype data | https://reich.hms.harvard.edu/software |
| LASER | PCA | Calculate PCA from sequencing data (low pass) | http://laser.sph.umich.edu/ |
| FlashPCA | PCA | Rapid calculation of PCA | https://github.com/gabraham/flashpca |
| PC-AiR | PCA | PCA in samples that may contain cryptically related participants | http://bioconductor.org/packages/release/bioc/html/GENESIS.html |
| PCAmask | PCA | PCA in highly structured populations | https://github.com/armartin/ancestry_pipeline |
| PLINK | MDS | Calculation of multidimensional scaling variables from IBD distance matrix | http://zzz.bwh.harvard.edu/plink/ |
| EMMA | Mixed model | Perform linear mixed model analysis for quantitative traits | http://mouse.cs.ucla.edu/emma/ |
| GEMMA | Mixed Model | Perform linear mixed model analysis for quantitative traits | http://www.xzlab.org/software.html |
| EMMAX | Mixed Model | Perform linear mixed model analysis for quantitative traits more quickly than EMMA | http://genetics.cs.ucla.edu/emmax/ |
| LD score regression | LD score regression | Calculate genomic inflation parameters accounting for LD | https://github.com/bulik/ldsc |
| PC loading regression | PC loading regression | Improved PS control compared with PCA | Not yet available |
| GAP, SCGAP | Geographic Ancestry Positioning | probabilistic spatial genetic model and ancestry localization algorithm, as well as the related population stratification correction procedure for genome-wide association studies, SCGAP, | https://github.com/anand-bhaskar/gap |
| SNPweights | SNPweights | inferring genome-wide genetic ancestry using SNP weights precomputed from large external reference panels | https://www.hsph.harvard.edu/alkes-price/software/ |
| NGSadmix | NGSadmix | Infer admixture proportions from NGS data | http://www.popgen.dk/software/index.php/NgsAdmix |
References
- Abraham G, Inouye M. Fast principal component analysis of large-scale genome-wide data. PLoS One. 2014;9(4):e93766. doi: 10.1371/journal.pone.0093766. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Alexander DH, Lange K. Enhancements to the ADMIXTURE algorithm for individual ancestry estimation. BMC Bioinformatics. 2011;12:246. doi: 10.1186/1471-2105-12-246. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Alexander DH, Novembre J, Lange K. Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 2009;19(9):1655–64. doi: 10.1101/gr.094052.109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Astle W, Balding DJ. Population Structure and Cryptic Relatedness in Genetic Association Studies. Statistical Science. 2009;24(4):451–471. [Google Scholar]
- Bacanu SA, Devlin B, Roeder K. Association studies for quantitative traits in structured populations. Genet Epidemiol. 2002;22(1):78–93. doi: 10.1002/gepi.1045. [DOI] [PubMed] [Google Scholar]
- Baran Y, Pasaniuc B, Sankararaman S, Torgerson DG, Gignoux C, Eng C, Rodriguez-Cintron W, Chapela R, Ford JG, Avila PC, et al. Fast and accurate inference of local ancestry in Latino populations. Bioinformatics (Oxford, England) 2012;28:1359–1367. doi: 10.1093/bioinformatics/bts144. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bayless TM, Brown E, Paige DM. Lactase Non-persistence and Lactose Intolerance. Curr Gastroenterol Rep. 2017;19(5):23. doi: 10.1007/s11894-017-0558-9. [DOI] [PubMed] [Google Scholar]
- Bell GI, Horita S, Karam JH. A polymorphic locus near the human insulin gene is associated with insulin-dependent diabetes mellitus. Diabetes. 1984;33(2):176–83. doi: 10.2337/diab.33.2.176. [DOI] [PubMed] [Google Scholar]
- Benn-Torres J, Bonilla C, Robbins CM, Waterman L, Moses TY, Hernandez W, Santos ER, Bennett F, Aiken W, Tullock T, et al. Admixture and population stratification in African Caribbean populations. Ann Hum Genet. 2008;72(Pt 1):90–8. doi: 10.1111/j.1469-1809.2007.00398.x. [DOI] [PubMed] [Google Scholar]
- Bersaglieri T, Sabeti PC, Patterson N, Vanderploeg T, Schaffner SF, Drake JA, Rhodes M, Reich DE, Hirschhorn JN. Genetic signatures of strong recent positive selection at the lactase gene. Am J Hum Genet. 2004;74(6):1111–20. doi: 10.1086/421051. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bhaskar A, Javanmard A, Courtade TA, Tse D. Novel probabilistic models of spatial genetic ancestry with applications to stratification correction in genome-wide association studies. Bioinformatics. 2017;33(6):879–885. doi: 10.1093/bioinformatics/btw720. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bhatia G, Furlotte NA, Loh P-R, Liu X, Finucane HK, Gusev A, Price A. Correcting subtle stratification in summary association statistics. bioRxiv. 2016:076133. [Google Scholar]
- Bhatia G, Gusev A, Loh P-R, Finucane HK, Vilhjalmsson BJ, Ripke S, Purcell S, Stahl E, Daly M, de Candia TR, et al. Subtle stratification confounds estimates of heritability from rare variants. bioRxiv 2016 [Google Scholar]
- Bomblies K, Yant L, Laitinen RA, Kim ST, Hollister JD, Warthmann N, Fitz J, Weigel D. Local-scale patterns of genetic variability, outcrossing, and spatial structure in natural stands of Arabidopsis thaliana. PLoS Genet. 2010;6(3):e1000890. doi: 10.1371/journal.pgen.1000890. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Browning SR, Thompson EA. Detecting rare variant associations by identity-by-descent mapping in case-control studies. Genetics. 2012;190(4):1521–31. doi: 10.1534/genetics.111.136937. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Brucato N, Tortevoye P, Plancoulaine S, Guitard E, Sanchez-Mazas A, Larrouy G, Gessain A, Dugoujon JM. The genetic diversity of three peculiar populations descending from the slave trade: Gm study of Noir Marron from French Guiana. C R Biol. 2009;332(10):917–26. doi: 10.1016/j.crvi.2009.07.005. [DOI] [PubMed] [Google Scholar]
- Bryc K, Auton A, Nelson MR, Oksenberg JR, Hauser SL, Williams S, Froment A, Bodo JM, Wambebe C, Tishkoff SA, et al. Genome-wide patterns of population structure and admixture in West Africans and African Americans. Proc Natl Acad Sci U S A. 2010;107(2):786–91. doi: 10.1073/pnas.0909559107. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bryc K, Durand EY, Macpherson JM, Reich D, Mountain JL. The genetic ancestry of African Americans, Latinos, and European Americans across the United States. Am J Hum Genet. 2015;96(1):37–53. doi: 10.1016/j.ajhg.2014.11.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bulik-Sullivan BK, Loh PR, Finucane HK, Ripke S, Yang J, Patterson N, Daly MJ, Price AL, Neale BM. LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat Genet. 2015;47(3):291–5. doi: 10.1038/ng.3211. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Campbell CD, Ogburn EL, Lunetta KL, Lyon HN, Freedman ML, Groop LC, Altshuler D, Ardlie KG, Hirschhorn JN. Demonstrating stratification in a European American population. Nat Genet. 2005;37(8):868–72. doi: 10.1038/ng1607. [DOI] [PubMed] [Google Scholar]
- Cardon LR, Palmer LJ. Population stratification and spurious allelic association. Lancet. 2003;361(9357):598–604. doi: 10.1016/S0140-6736(03)12520-2. [DOI] [PubMed] [Google Scholar]
- Carmi S, Hui KY, Kochav E, Liu X, Xue J, Grady F, Guha S, Upadhyay K, Ben-Avraham D, Mukherjee S, et al. Sequencing an Ashkenazi reference panel supports population-targeted personal genomics and illuminates Jewish and European origins. Nat Commun. 2014;5:4835. doi: 10.1038/ncomms5835. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen CY, Pollack S, Hunter DJ, Hirschhorn JN, Kraft P, Price AL. Improved ancestry inference using weights from external reference panels. Bioinformatics. 2013;29(11):1399–406. doi: 10.1093/bioinformatics/btt144. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cheng YJ, Mailund T, Nielsen R. Fast admixture analysis and population tree estimation for SNP and NGS data. Bioinformatics. 2017 doi: 10.1093/bioinformatics/btx098. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Choudhry S, Coyle NE, Tang H, Salari K, Lind D, Clark SL, Tsai HJ, Naqvi M, Phong A, Ung N, et al. Population stratification confounds genetic association studies among Latinos. Hum Genet. 2006;118(5):652–64. doi: 10.1007/s00439-005-0071-3. [DOI] [PubMed] [Google Scholar]
- Chung RH, Schmidt MA, Morris RW, Martin ER. CAPL: a novel association test using case-control and family data and accounting for population stratification. Genet Epidemiol. 2010;34(7):747–55. doi: 10.1002/gepi.20539. [DOI] [PubMed] [Google Scholar]
- Conomos MP, Miller MB, Thornton TA. Robust inference of population structure for ancestry prediction and correction of stratification in the presence of relatedness. Genet Epidemiol. 2015;39(4):276–93. doi: 10.1002/gepi.21896. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dadd T, Weale ME, Lewis CM. A critical evaluation of genomic control methods for genetic association studies. Genet Epidemiol. 2009;33(4):290–8. doi: 10.1002/gepi.20379. [DOI] [PubMed] [Google Scholar]
- Dandine-Roulland C, Bellenguez C, Debette S, Amouyel P, Genin E, Perdry H. Accuracy of heritability estimations in presence of hidden population stratification. Sci Rep. 2016;6:26471. doi: 10.1038/srep26471. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Deelen P, Menelaou A, van Leeuwen EM, Kanterakis A, van Dijk F, Medina-Gomez C, Francioli LC, Hottenga JJ, Karssen LC, Estrada K, et al. Improved imputation quality of low-frequency and rare variants in European samples using the ‘Genome of The Netherlands’. Eur J Hum Genet. 2014;22(11):1321–6. doi: 10.1038/ejhg.2014.19. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Devlin B, Bacanu SA, Roeder K. Genomic Control to the extreme. Nat Genet. 2004;36(11):1129–30. doi: 10.1038/ng1104-1129. author reply 1131. [DOI] [PubMed] [Google Scholar]
- Devlin B, Roeder K. Genomic control for association studies. Biometrics. 1999;55(4):997–1004. doi: 10.1111/j.0006-341x.1999.00997.x. [DOI] [PubMed] [Google Scholar]
- Devlin B, Roeder K, Wasserman L. Genomic control, a new approach to genetic-based association studies. Theor Popul Biol. 2001;60(3):155–66. doi: 10.1006/tpbi.2001.1542. [DOI] [PubMed] [Google Scholar]
- Enattah NS, Sahi T, Savilahti E, Terwilliger JD, Peltonen L, Jarvela I. Identification of a variant associated with adult-type hypolactasia. Nat Genet. 2002;30(2):233–7. doi: 10.1038/ng826. [DOI] [PubMed] [Google Scholar]
- Epstein MP, Allen AS, Satten GA. A simple and improved correction for population stratification in case-control studies. Am J Hum Genet. 2007;80(5):921–30. doi: 10.1086/516842. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Eriksson N, Macpherson JM, Tung JY, Hon LS, Naughton B, Saxonov S, Avey L, Wojcicki A, Pe’er I, Mountain J. Web-based, participant-driven studies yield novel genetic associations for common traits. PLoS Genet. 2010;6(6):e1000993. doi: 10.1371/journal.pgen.1000993. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Falush D, Stephens M, Pritchard JK. Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. Genetics. 2003;164(4):1567–87. doi: 10.1093/genetics/164.4.1567. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Franceschini N, Fox E, Zhang Z, Edwards TL, Nalls MA, Sung YJ, Tayo BO, Sun YV, Gottesman O, Adeyemo A, et al. Genome-wide association analysis of blood-pressure traits in African-ancestry individuals reveals common associated genes in African and non-African populations. Am J Hum Genet. 2013;93(3):545–54. doi: 10.1016/j.ajhg.2013.07.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Freedman ML, Reich D, Penney KL, McDonald GJ, Mignault AA, Patterson N, Gabriel SB, Topol EJ, Smoller JW, Pato CN, et al. Assessing the impact of population stratification on genetic association studies. Nat Genet. 2004;36(4):388–93. doi: 10.1038/ng1333. [DOI] [PubMed] [Google Scholar]
- Friedrich DC, Santos SE, Ribeiro-dos-Santos AK, Hutz MH. Several different lactase persistence associated alleles and high diversity of the lactase gene in the admixed Brazilian population. PLoS One. 2012;7(9):e46520. doi: 10.1371/journal.pone.0046520. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Galinsky KJ, Bhatia G, Loh PR, Georgiev S, Mukherjee S, Patterson NJ, Price AL. Fast Principal-Component Analysis Reveals Convergent Evolution of ADH1B in Europe and East Asia. Am J Hum Genet. 2016;98(3):456–72. doi: 10.1016/j.ajhg.2015.12.022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Galvan A, Vorraro F, Cabrera W, Ribeiro OG, Starobinas N, Jensen JR, dos Santos Carneiro P, De Franco M, Gao X, Ibanez OC, et al. Association study by genetic clustering detects multiple inflammatory response loci in non-inbred mice. Genes Immun. 2011;12(5):390–4. doi: 10.1038/gene.2011.10. [DOI] [PubMed] [Google Scholar]
- Gao X, Martin ER. Using allele sharing distance for detecting human population stratification. Hum Hered. 2009;68(3):182–91. doi: 10.1159/000224638. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gao X, Starmer J. Human population structure detection via multilocus genotype clustering. BMC Genet. 2007;8:34. doi: 10.1186/1471-2156-8-34. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Genomes Project C. Abecasis GR, Auton A, Brooks LD, DePristo MA, Durbin RM, Handsaker RE, Kang HM, Marth GT, McVean GA. An integrated map of genetic variation from 1,092 human genomes. Nature. 2012;491(7422):56–65. doi: 10.1038/nature11632. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Goddard KA, Hopkins PJ, Hall JM, Witte JS. Linkage disequilibrium and allele-frequency distributions for 114 single-nucleotide polymorphisms in five populations. Am J Hum Genet. 2000;66(1):216–34. doi: 10.1086/302727. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gomes KF, Santos AS, Semzezem C, Correia MR, Brito LA, Ruiz MO, Fukui RT, Matioli SR, Passos-Bueno MR, Silva ME. The influence of population stratification on genetic markers associated with type 1 diabetes. Sci Rep. 2017;7:43513. doi: 10.1038/srep43513. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gravel S, Henn BM, Gutenkunst RN, Indap AR, Marth GT, Clark AG, Yu F, Gibbs RA, Genomes P, Bustamante CD. Demographic history and rare allele sharing among human populations. Proc Natl Acad Sci U S A. 2011;108(29):11983–8. doi: 10.1073/pnas.1019276108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Green RE, Krause J, Briggs AW, Maricic T, Stenzel U, Kircher M, Patterson N, Li H, Zhai W, Fritz MH, et al. A draft sequence of the Neandertal genome. Science. 2010;328(5979):710–22. doi: 10.1126/science.1188021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Harris K, Nielsen R. Inferring demographic history from a spectrum of shared haplotype lengths. PLoS Genet. 2013;9(6):e1003521. doi: 10.1371/journal.pgen.1003521. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Higasa K, Miyake N, Yoshimura J, Okamura K, Niihori T, Saitsu H, Doi K, Shimizu M, Nakabayashi K, Aoki Y, et al. Human genetic variation database, a reference database of genetic variations in the Japanese population. J Hum Genet. 2016;61(6):547–53. doi: 10.1038/jhg.2016.12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hoggart CJ, Parra EJ, Shriver MD, Bonilla C, Kittles RA, Clayton DG, McKeigue PM. Control of confounding of genetic associations in stratified populations. Am J Hum Genet. 2003;72(6):1492–1504. doi: 10.1086/375613. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hoggart CJ, Shriver MD, Kittles RA, Clayton DG, McKeigue PM. Design and analysis of admixture mapping studies. Am J Hum Genet. 2004;74(5):965–78. doi: 10.1086/420855. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Holsinger KE, Weir BS. Genetics in geographically structured populations: defining, estimating and interpreting F(ST) Nat Rev Genet. 2009;10(9):639–50. doi: 10.1038/nrg2611. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Homa DM, Mannino DM, Lara M. Asthma mortality in U.S. Hispanics of Mexican, Puerto Rican, and Cuban heritage, 1990–1995. Am J Respir Crit Care Med. 2000;161(2 Pt 1):504–9. doi: 10.1164/ajrccm.161.2.9906025. [DOI] [PubMed] [Google Scholar]
- Huang J, Howie B, McCarthy S, Memari Y, Walter K, Min JL, Danecek P, Malerba G, Trabetti E, Zheng HF, et al. Improved imputation of low-frequency and rare variants using the UK10K haplotype reference panel. Nat Commun. 2015;6:8111. doi: 10.1038/ncomms9111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ingram CJ, Elamin MF, Mulcare CA, Weale ME, Tarekegn A, Raga TO, Bekele E, Elamin FM, Thomas MG, Bradman N, et al. A novel polymorphism associated with lactose tolerance in Africa: multiple causes for lactase persistence? Hum Genet. 2007;120(6):779–88. doi: 10.1007/s00439-006-0291-1. [DOI] [PubMed] [Google Scholar]
- Ingram CJ, Raga TO, Tarekegn A, Browning SL, Elamin MF, Bekele E, Thomas MG, Weale ME, Bradman N, Swallow DM. Multiple rare variants as a cause of a common phenotype: several different lactase persistence associated alleles in a single ethnic group. J Mol Evol. 2009;69(6):579–88. doi: 10.1007/s00239-009-9301-y. [DOI] [PubMed] [Google Scholar]
- Jenkins DL, Davis LG, Stafford TW, Jr, Campos PF, Hockett B, Jones GT, Cummings LS, Yost C, Connolly TJ, Yohe RM, 2nd, et al. Clovis age Western Stemmed projectile points and human coprolites at the Paisley Caves. Science. 2012;337(6091):223–8. doi: 10.1126/science.1218443. [DOI] [PubMed] [Google Scholar]
- Johnson NA, Coram MA, Shriver MD, Romieu I, Barsh GS, London SJ, Tang H. Ancestral components of admixed genomes in a Mexican cohort. PLoS Genet. 2011;7(12):e1002410. doi: 10.1371/journal.pgen.1002410. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kang HM, Sul JH, Service SK, Zaitlen NA, Kong SY, Freimer NB, Sabatti C, Eskin E. Variance component model to account for sample structure in genome-wide association studies. Nat Genet. 2010;42(4):348–54. doi: 10.1038/ng.548. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kang HM, Zaitlen NA, Wade CM, Kirby A, Heckerman D, Daly MJ, Eskin E. Efficient control of population structure in model organism association mapping. Genetics. 2008;178(3):1709–23. doi: 10.1534/genetics.107.080101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kawai Y, Mimori T, Kojima K, Nariai N, Danjoh I, Saito R, Yasuda J, Yamamoto M, Nagasaki M. Japonica array: improved genotype imputation by designing a population-specific SNP array with 1070 Japanese individuals. J Hum Genet. 2015;60(10):581–7. doi: 10.1038/jhg.2015.68. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kim K, Bang SY, Lee HS, Bae SC. Construction and application of a Korean reference panel for imputing classical alleles and amino acids of human leukocyte antigen genes. PLoS One. 2014;9(11):e112546. doi: 10.1371/journal.pone.0112546. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Knowler WC, Williams RC, Pettitt DJ, Steinberg AG. Gm3;5,13,14 and type 2 diabetes mellitus: an association in American Indians with genetic admixture. Am J Hum Genet. 1988;43(4):520–6. [PMC free article] [PubMed] [Google Scholar]
- Kodaman N, Aldrich MC, Smith JR, Signorello LB, Bradley K, Breyer J, Cohen SS, Long J, Cai Q, Giles J, et al. A small number of candidate gene SNPs reveal continental ancestry in African Americans. Ann Hum Genet. 2013;77(1):56–66. doi: 10.1111/j.1469-1809.2012.00738.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kohler K, Bickeboller H. Case-control association tests correcting for population stratification. Ann Hum Genet. 2006;70(Pt 1):98–115. doi: 10.1111/j.1529-8817.2005.00214.x. [DOI] [PubMed] [Google Scholar]
- Lappalainen T, Salmela E, Andersen PM, Dahlman-Wright K, Sistonen P, Savontaus ML, Schreiber S, Lahermo P, Kere J. Genomic landscape of positive natural selection in Northern European populations. Eur J Hum Genet. 2010;18(4):471–8. doi: 10.1038/ejhg.2009.184. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lee SH, Wray NR, Goddard ME, Visscher PM. Estimating missing heritability for disease from genome-wide association studies. Am J Hum Genet. 2011;88(3):294–305. doi: 10.1016/j.ajhg.2011.02.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lewinsky RH, Jensen TG, Moller J, Stensballe A, Olsen J, Troelsen JT. T-13910 DNA variant associated with lactase persistence interacts with Oct-1 and stimulates lactase promoter activity in vitro. Hum Mol Genet. 2005;14(24):3945–53. doi: 10.1093/hmg/ddi418. [DOI] [PubMed] [Google Scholar]
- Li H, Durbin R. Inference of human population history from individual whole-genome sequences. Nature. 2011;475(7357):493–6. doi: 10.1038/nature10231. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li N, Stephens M. Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. Genetics. 2003;165:2213–2233. doi: 10.1093/genetics/165.4.2213. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Listgarten J, Lippert C, Kadie CM, Davidson RI, Eskin E, Heckerman D. Improved linear mixed models for genome-wide association studies. Nat Methods. 2012;9(6):525–6. doi: 10.1038/nmeth.2037. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Low-Kam C, Rhainds D, Lo KS, Provost S, Mongrain I, Dubois A, Perreault S, Robinson JF, Hegele RA, Dube MP, et al. Whole-genome sequencing in French Canadians from Quebec. Hum Genet. 2016;135(11):1213–1221. doi: 10.1007/s00439-016-1702-6. [DOI] [PubMed] [Google Scholar]
- Mallick S, Li H, Lipson M, Mathieson I, Gymrek M, Racimo F, Zhao M, Chennagiri N, Nordenfelt S, Tandon A, et al. The Simons Genome Diversity Project: 300 genomes from 142 diverse populations. Nature. 2016;538(7624):201–206. doi: 10.1038/nature18964. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Maples BK, Gravel S, Kenny EE, Bustamante CD. RFMix: a discriminative modeling approach for rapid and robust local-ancestry inference. American Journal of Human Genetics. 2013;93:278–288. doi: 10.1016/j.ajhg.2013.06.020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Marcheco-Teruel B, Parra EJ, Fuentes-Smith E, Salas A, Buttenschon HN, Demontis D, Torres-Espanol M, Marin-Padron LC, Gomez-Cabezas EJ, Alvarez-Iglesias V, et al. Cuba: exploring the history of admixture and the genetic basis of pigmentation using autosomal and uniparental markers. PLoS Genet. 2014;10(7):e1004488. doi: 10.1371/journal.pgen.1004488. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Marchini J, Cardon LR, Phillips MS, Donnelly P. The effects of human population structure on large genetic association studies. Nat Genet. 2004;36(5):512–7. doi: 10.1038/ng1337. [DOI] [PubMed] [Google Scholar]
- Martin LS, Eskin E. Review: Population Structure in Genetic Studies: Confounding Factors and Mixed Models. bioRxiv. 2017 doi: 10.1371/journal.pgen.1007309. https://doi.org/10.1101/092106. [DOI] [PMC free article] [PubMed]
- McCarthy S, Das S, Kretzschmar W, Delaneau O, Wood AR, Teumer A, Kang HM, Fuchsberger C, Danecek P, Sharp K, et al. A reference panel of 64,976 haplotypes for genotype imputation. Nat Genet. 2016;48(10):1279–83. doi: 10.1038/ng.3643. [DOI] [PMC free article] [PubMed] [Google Scholar]
- McDougall I, Brown FH, Fleagle JG. Stratigraphic placement and age of modern humans from Kibish, Ethiopia. Nature. 2005;433(7027):733–6. doi: 10.1038/nature03258. [DOI] [PubMed] [Google Scholar]
- McKeigue PM. Mapping genes that underlie ethnic differences in disease risk: methods for detecting linkage in admixed populations, by conditioning on parental admixture. American Journal of Human Genetics. 1998;63:241–251. doi: 10.1086/301908. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mendizabal I, Sandoval K, Berniell-Lee G, Calafell F, Salas A, Martinez-Fuentes A, Comas D. Genetic origin, admixture, and asymmetry in maternal and paternal human lineages in Cuba. BMC Evol Biol. 2008;8:213. doi: 10.1186/1471-2148-8-213. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Meyer M, Kircher M, Gansauge MT, Li H, Racimo F, Mallick S, Schraiber JG, Jay F, Prufer K, de Filippo C, et al. A high-coverage genome sequence from an archaic Denisovan individual. Science. 2012;338(6104):222–6. doi: 10.1126/science.1224344. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Miclaus K, Wolfinger R, Czika W. SNP selection and multidimensional scaling to quantify population structure. Genet Epidemiol. 2009;33(6):488–96. doi: 10.1002/gepi.20401. [DOI] [PubMed] [Google Scholar]
- Monda KL, Chen GK, Taylor KC, Palmer C, Edwards TL, Lange LA, Ng MC, Adeyemo AA, Allison MA, Bielak LF, et al. A meta-analysis identifies new loci associated with body mass index in individuals of African ancestry. Nat Genet. 2013;45(6):690–6. doi: 10.1038/ng.2608. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Montana G, Hoggart C. Statistical software for gene mapping by admixture linkage disequilibrium. Brief Bioinform. 2007;8(6):393–5. doi: 10.1093/bib/bbm035. [DOI] [PubMed] [Google Scholar]
- Montana G, Pritchard JK. Statistical tests for admixture mapping with case-control and cases-only data. Am J Hum Genet. 2004;75(5):771–89. doi: 10.1086/425281. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Moreno-Estrada A, Gravel S, Zakharia F, McCauley JL, Byrnes JK, Gignoux CR, Ortiz-Tello PA, Martinez RJ, Hedges DJ, Morris RW, et al. Reconstructing the population genetic history of the Caribbean. PLoS Genet. 2013;9(11):e1003925. doi: 10.1371/journal.pgen.1003925. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mulcare CA, Weale ME, Jones AL, Connell B, Zeitlyn D, Tarekegn A, Swallow DM, Bradman N, Thomas MG. The T allele of a single-nucleotide polymorphism 13.9 kb upstream of the lactase gene (LCT) (C-13.9kbT) does not predict or cause the lactase-persistence phenotype in Africans. Am J Hum Genet. 2004;74(6):1102–10. doi: 10.1086/421050. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nielsen R, Akey JM, Jakobsson M, Pritchard JK, Tishkoff S, Willerslev E. Tracing the peopling of the world through genomics. Nature. 2017;541(7637):302–310. doi: 10.1038/nature21347. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Novembre J, Johnson T, Bryc K, Kutalik Z, Boyko AR, Auton A, Indap A, King KS, Bergmann S, Nelson MR, et al. Genes mirror geography within Europe. Nature. 2008;456(7218):98–101. doi: 10.1038/nature07331. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Olds LC, Sibley E. Lactase persistence DNA variant enhances lactase promoter activity in vitro: functional role as a cis regulatory element. Hum Mol Genet. 2003;12(18):2333–40. doi: 10.1093/hmg/ddg244. [DOI] [PubMed] [Google Scholar]
- Ott J, Wang J, Leal SM. Genetic linkage analysis in the age of whole-genome sequencing. Nat Rev Genet. 2015;16(5):275–84. doi: 10.1038/nrg3908. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pasaniuc B, Sankararaman S, Kimmel G, Halperin E. Inference of locus-specific ancestry in closely related populations. Bioinformatics (Oxford, England) 2009;25:i213–221. doi: 10.1093/bioinformatics/btp197. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pasaniuc B, Zaitlen N, Lettre G, Chen GK, Tandon A, Kao WHL, Ruczinski I, Fornage M, Siscovick DS, Zhu X, et al. Enhanced statistical tests for GWAS in admixed populations: assessment using African Americans from CARe and a Breast Cancer Consortium. PLoS genetics. 2011;7:e1001371. doi: 10.1371/journal.pgen.1001371. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Patterson N, Hattangadi N, Lane B, Lohmueller KE, Hafler DA, Oksenberg JR, Hauser SL, Smith MW, O’Brien SJ, Altshuler D, et al. Methods for high-density admixture mapping of disease genes. Am J Hum Genet. 2004;74(5):979–1000. doi: 10.1086/420871. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pena SD, Di Pietro G, Fuchshuber-Moraes M, Genro JP, Hutz MH, de Kehdy FS, Kohlrausch F, Magno LA, Montenegro RC, Moraes MO, et al. The genomic ancestry of individuals from different geographical regions of Brazil is more uniform than expected. PLoS One. 2011;6(2):e17063. doi: 10.1371/journal.pone.0017063. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D. Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet. 2006;38(8):904–9. doi: 10.1038/ng1847. [DOI] [PubMed] [Google Scholar]
- Price AL, Tandon A, Patterson N, Barnes KC, Rafaels N, Ruczinski I, Beaty TH, Mathias R, Reich D, Myers S. Sensitive detection of chromosomal segments of distinct ancestry in admixed populations. PLoS genetics. 2009;5:e1000519. doi: 10.1371/journal.pgen.1000519. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Price AL, Zaitlen NA, Reich D, Patterson N. New approaches to population stratification in genome-wide association studies. Nat Rev Genet. 2010;11(7):459–63. doi: 10.1038/nrg2813. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pritchard JK, Donnelly P. Case-control studies of association in structured or admixed populations. Theor Popul Biol. 2001;60(3):227–37. doi: 10.1006/tpbi.2001.1543. [DOI] [PubMed] [Google Scholar]
- Pritchard JK, Stephens M, Donnelly P. Inference of population structure using multilocus genotype data. Genetics. 2000a;155(2):945–59. doi: 10.1093/genetics/155.2.945. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pritchard JK, Stephens M, Rosenberg NA, Donnelly P. Association mapping in structured populations. Am J Hum Genet. 2000b;67(1):170–81. doi: 10.1086/302959. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar P, de Bakker PI, Daly MJ, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007;81(3):559–75. doi: 10.1086/519795. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rosenberg NA, Pritchard JK, Weber JL, Cann HM, Kidd KK, Zhivotovsky LA, Feldman MW. Genetic structure of human populations. Science. 2002;298(5602):2381–5. doi: 10.1126/science.1078311. [DOI] [PubMed] [Google Scholar]
- Ruiz-Narvaez EA, Rosenberg L, Wise LA, Reich D, Palmer JR. Validation of a small set of ancestral informative markers for control of population admixture in African Americans. Am J Epidemiol. 2011;173(5):587–92. doi: 10.1093/aje/kwq401. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sankararaman S, Sridhar S, Kimmel G, Halperin E. Estimating local ancestry in admixed populations. American Journal of Human Genetics. 2008;82:290–303. doi: 10.1016/j.ajhg.2007.09.022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schanfield MS, Kirk RL. Further studies on the immunoglobulin allotypes (Gm, Am and Km) in India. Acta Anthropogenet. 1981;5(1):1–21. [PubMed] [Google Scholar]
- Schlebusch CM, Skoglund P, Sjodin P, Gattepaille LM, Hernandez D, Jay F, Li S, De Jongh M, Singleton A, Blum MG, et al. Genomic variation in seven Khoe-San groups reveals adaptation and complex African history. Science. 2012;338(6105):374–9. doi: 10.1126/science.1227721. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Seldin MF, Pasaniuc B, Price AL. New approaches to disease mapping in admixed populations. Nature Reviews Genetics. 2011;12:523–528. doi: 10.1038/nrg3002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shriner D. Overview of admixture mapping. Current Protocols in Human Genetics. 2013;Chapter 1(Unit 1.23) doi: 10.1002/0471142905.hg0123s76. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shriner D, Adeyemo A, Rotimi CN. Joint ancestry and association testing in admixed individuals. PLoS computational biology. 2011;7:e1002325. doi: 10.1371/journal.pcbi.1002325. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Silva ME, Mory D, Davini E. Genetic and humoral autoimmunity markers of type 1 diabetes: from theory to practice. Arq Bras Endocrinol Metabol. 2008;52(2):166–80. doi: 10.1590/s0004-27302008000200004. [DOI] [PubMed] [Google Scholar]
- Skotte L, Korneliussen TS, Albrechtsen A. Estimating individual admixture proportions from next generation sequencing data. Genetics. 2013;195(3):693–702. doi: 10.1534/genetics.113.154138. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Spielman RS, Baur MP, Clerget-Darpoux F. Genetic analysis of IDDM: summary of GAW5 IDDM results. Genet Epidemiol. 1989;6(1):43–58. doi: 10.1002/gepi.1370060111. [DOI] [PubMed] [Google Scholar]
- Spielman RS, McGinnis RE, Ewens WJ. Transmission test for linkage disequilibrium: the insulin gene region and insulin-dependent diabetes mellitus (IDDM) Am J Hum Genet. 1993;52(3):506–16. [PMC free article] [PubMed] [Google Scholar]
- Steele CD, Court DS, Balding DJ. Worldwide F(ST) estimates relative to five continental-scale populations. Ann Hum Genet. 2014;78(6):468–77. doi: 10.1111/ahg.12081. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sundquist A, Fratkin E, Do CB, Batzoglou S. Effect of genetic divergence in identifying ancestral origin using HAPAA. Genome Research. 2008;18:676–682. doi: 10.1101/gr.072850.107. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Taliun D, Chothani SP, Schonherr S, Forer L, Boehnke M, Abecasis GR, Wang C. LASER server: ancestry tracing with genotypes or sequence reads. Bioinformatics. 2017 doi: 10.1093/bioinformatics/btx075. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tang D, Anderson D, Francis RW, Syn G, Jamieson SE, Lassmann T, Blackwell JM. Reference genotype and exome data from an Australian Aboriginal population for health-based research. Sci Data. 2016;3:160023. doi: 10.1038/sdata.2016.23. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tang H, Coram M, Wang P, Zhu X, Risch N. Reconstructing genetic ancestry blocks in admixed individuals. Am J Hum Genet. 2006;79(1):1–12. doi: 10.1086/504302. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Teare MD, Santibanez Koref MF. Linkage analysis and the study of Mendelian disease in the era of whole exome and genome sequencing. Brief Funct Genomics. 2014;13(5):378–83. doi: 10.1093/bfgp/elu024. [DOI] [PubMed] [Google Scholar]
- Thareja G, John SE, Hebbar P, Behbehani K, Thanaraj TA, Alsmadi O. Sequence and analysis of a whole genome from Kuwaiti population subgroup of Persian ancestry. BMC Genomics. 2015;16:92. doi: 10.1186/s12864-015-1233-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Thomson G, Valdes AM, Noble JA, Kockum I, Grote MN, Najman J, Erlich HA, Cucca F, Pugliese A, Steenkiste A, et al. Relative predispositional effects of HLA class II DRB1-DQB1 haplotypes and genotypes on type 1 diabetes: a meta-analysis. Tissue Antigens. 2007;70(2):110–27. doi: 10.1111/j.1399-0039.2007.00867.x. [DOI] [PubMed] [Google Scholar]
- Thornton T, McPeek MS. ROADTRIPS: case-control association testing with partially or completely unknown population and pedigree structure. Am J Hum Genet. 2010;86(2):172–84. doi: 10.1016/j.ajhg.2010.01.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tishkoff SA, Reed FA, Friedlaender FR, Ehret C, Ranciaro A, Froment A, Hirbo JB, Awomoyi AA, Bodo JM, Doumbo O, et al. The genetic structure and history of Africans and African Americans. Science. 2009;324(5930):1035–44. doi: 10.1126/science.1172257. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tishkoff SA, Reed FA, Ranciaro A, Voight BF, Babbitt CC, Silverman JS, Powell K, Mortensen HM, Hirbo JB, Osman M, et al. Convergent adaptation of human lactase persistence in Africa and Europe. Nat Genet. 2007;39(1):31–40. doi: 10.1038/ng1946. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tung JY, Do CB, Hinds DA, Kiefer AK, Macpherson JM, Chowdry AB, Francke U, Naughton BT, Mountain JL, Wojcicki A, et al. Efficient replication of over 180 genetic associations with self-reported medical data. PLoS One. 2011;6(8):e23473. doi: 10.1371/journal.pone.0023473. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Velez DR, Fortunato SJ, Thorsen P, Lombardi SJ, Williams SM, Menon R. Preterm birth in Caucasians is associated with coagulation and inflammation pathway gene variants. PLoS One. 2008;3(9):e3283. doi: 10.1371/journal.pone.0003283. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vernot B, Akey JM. Complex history of admixture between modern humans and Neandertals. Am J Hum Genet. 2015;96(3):448–53. doi: 10.1016/j.ajhg.2015.01.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vigilant L, Stoneking M, Harpending H, Hawkes K, Wilson AC. African populations and the evolution of human mitochondrial DNA. Science. 1991;253(5027):1503–7. doi: 10.1126/science.1840702. [DOI] [PubMed] [Google Scholar]
- Voight BF, Pritchard JK. Confounding from cryptic relatedness in case-control association studies. PLoS Genet. 2005;1(3):e32. doi: 10.1371/journal.pgen.0010032. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wahlund S. Composition of populations from the perspective of the theory of heredity. Hereditas. 1928;11:65–105. [Google Scholar]
- Wang C, Zhan X, Bragg-Gresham J, Kang HM, Stambolian D, Chew EY, Branham KE, Heckenlively J, Study F, Fulton R, et al. Ancestry estimation and control of population stratification for sequence-based association studies. Nat Genet. 2014;46(4):409–15. doi: 10.1038/ng.2924. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang C, Zhan X, Liang L, Abecasis GR, Lin X. Improved ancestry estimation for both genotyping and sequencing data using projection procrustes analysis and genotype imputation. Am J Hum Genet. 2015;96(6):926–37. doi: 10.1016/j.ajhg.2015.04.018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang C, Zollner S, Rosenberg NA. A quantitative comparison of the similarity between genes and geography in worldwide human populations. PLoS Genet. 2012a;8(8):e1002886. doi: 10.1371/journal.pgen.1002886. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang K, Hu X, Peng Y. An analytical comparison of the principal component method and the mixed effects model for association studies in the presence of cryptic relatedness and population stratification. Hum Hered. 2013;76(1):1–9. doi: 10.1159/000353345. [DOI] [PubMed] [Google Scholar]
- Wang S, Chen W, Chen X, Hu F, Archer KJ, Liu HN, Sun S, Gao G. Double genomic control is not effective to correct for population stratification in meta-analysis for genome-wide association studies. Front Genet. 2012b;3:300. doi: 10.3389/fgene.2012.00300. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang X, Zhu X, Qin H, Cooper RS, Ewens WJ, Li C, Li M. Adjustment for local ancestry in genetic association analysis of admixed populations. Bioinformatics. 2011;27(5):670–7. doi: 10.1093/bioinformatics/btq709. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Weir BS, Cockerham CC. Estimating F-Statistics for the Analysis of Population Structure. Evolution. 1984;38(6):1358–1370. doi: 10.1111/j.1558-5646.1984.tb05657.x. [DOI] [PubMed] [Google Scholar]
- Weir BS, Hill WG. Estimating F-statistics. Annu Rev Genet. 2002;36:721–50. doi: 10.1146/annurev.genet.36.050802.093940. [DOI] [PubMed] [Google Scholar]
- White TD, Asfaw B, DeGusta D, Gilbert H, Richards GD, Suwa G, Howell FC. Pleistocene Homo sapiens from Middle Awash, Ethiopia. Nature. 2003;423(6941):742–7. doi: 10.1038/nature01669. [DOI] [PubMed] [Google Scholar]
- Williams RC, Steinberg AG, Gershowitz H, Bennett PH, Knowler WC, Pettitt DJ, Butler W, Baird R, Dowda-Rea L, Burch TA, et al. GM allotypes in Native Americans: evidence for three distinct migrations across the Bering land bridge. Am J Phys Anthropol. 1985;66(1):1–19. doi: 10.1002/ajpa.1330660102. [DOI] [PubMed] [Google Scholar]
- Winkler CA, Nelson GW, Smith MW. Admixture mapping comes of age. Annual Review of Genomics and Human Genetics. 2010;11:65–89. doi: 10.1146/annurev-genom-082509-141523. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wong LP, Lai JK, Saw WY, Ong RT, Cheng AY, Pillai NE, Liu X, Xu W, Chen P, Foo JN, et al. Insights into the genetic structure and diversity of 38 South Asian Indians from deep whole-genome sequencing. PLoS Genet. 2014;10(5):e1004377. doi: 10.1371/journal.pgen.1004377. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wong LP, Ong RT, Poh WT, Liu X, Chen P, Li R, Lam KK, Pillai NE, Sim KS, Xu H, et al. Deep whole-genome sequencing of 100 southeast Asian Malays. Am J Hum Genet. 2013;92(1):52–66. doi: 10.1016/j.ajhg.2012.12.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wright S. Systems of Mating. V. General Considerations. Genetics. 1921;6(2):167–78. doi: 10.1093/genetics/6.2.167. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yang BZ, Zhao H, Kranzler HR, Gelernter J. Practical population group assignment with selected informative markers: characteristics and properties of Bayesian clustering via STRUCTURE. Genet Epidemiol. 2005;28(4):302–12. doi: 10.1002/gepi.20070. [DOI] [PubMed] [Google Scholar]
- Yang J, Benyamin B, McEvoy BP, Gordon S, Henders AK, Nyholt DR, Madden PA, Heath AC, Martin NG, Montgomery GW, et al. Common SNPs explain a large proportion of the heritability for human height. Nat Genet. 2010;42(7):565–9. doi: 10.1038/ng.608. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yang J, Lee SH, Goddard ME, Visscher PM. GCTA: a tool for genome-wide complex trait analysis. Am J Hum Genet. 2011;88(1):76–82. doi: 10.1016/j.ajhg.2010.11.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yang WY, Novembre J, Eskin E, Halperin E. A model-based approach for analysis of spatial structure in genetic data. Nat Genet. 2012;44(6):725–31. doi: 10.1038/ng.2285. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang F, Wang Y, Deng HW. Comparison of population-based association study methods correcting for population stratification. PLoS One. 2008;3(10):e3392. doi: 10.1371/journal.pone.0003392. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhou X, Stephens M. Genome-wide efficient mixed-model analysis for association studies. Nat Genet. 2012;44(7):821–4. doi: 10.1038/ng.2310. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhu X, Li S, Cooper RS, Elston RC. A unified association analysis approach for family and unrelated samples correcting for stratification. Am J Hum Genet. 2008;82(2):352–65. doi: 10.1016/j.ajhg.2007.10.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhu X, Zhang S, Tang H, Cooper R. A classical likelihood based approach for admixture mapping using EM algorithm. Hum Genet. 2006;120(3):431–45. doi: 10.1007/s00439-006-0224-z. [DOI] [PubMed] [Google Scholar]


