Abstract
Studies of human complex diseases and traits associated with candidate genes are potentially vulnerable to bias (confounding) due to population stratification and inbreeding, especially in admixed population. In genome-wide association studies (GWAS) the Principal Components (PCs) method provides a global ancestry value per subject, allowing corrections for population stratification. However, these coefficients are typically estimated assuming unrelated individuals and if family structure is present and it is ignored, such sub-structure may induce artifactual PCs. Extensions of the PCs method have been proposed by Konishi and Rao (1992) taking into account only siblings relatedness and by Oualkacha et al. (2012) taking into account large pedigrees and high dimensional phenotype data. In this work we extended these methods to estimate the global individual ancestry coefficients from PCs derived from different variance components matrix estimators using single nucleotide polymorphisms (SNPs) from two simulated data sets and two real data sets, the GENOA sibship data consisting of European and African American subjects and the Baependi Heart Study consisting of 80 extended Brazilian families, both with genotyping data from Affymetrix 6.0 chip. Our results showed that the family structure plays an important role in the estimation of the global individual ancestry for extended pedigrees but not for sibships.
Keywords: association analysis, multivariate polygenic mixed models, kinship matrix, admixed population
Introduction
Studies of human complex diseases and traits associated with candidate genes are potentially vulnerable to bias (confounding) due to population stratification and inbreeding, especially in admixed population. To correct for population stratification in genome-wide association studies (GWAS) several approaches were proposed such as genomic controls (GC) [1] and principal components (PCs) [2, 3]. The advantages to use PCs are that it can be applied for discrete and continuous variables, it can correct for population stratification, and it also provides a global ancestry value per subject as implemented in the EIGENSOFT software [2]. Its disadvantages are that it is only applicable for uncorrelated data sets and it does not handle admixed population in family data.
There are several packages available that take into account the admixture, calculating not only the global ancestry but also the local individual ancestry coefficients; but these have limitations. For instance, the program STRUCTURE can handle more than two ancestral populations but it cannot handle large number of single nucleotide polymorphisms (SNPs) and it is applicable only for unrelated subjects [4, 5]. On the other hand, the program HAPMIX can handle genome wide genotype data but it is only applicable for two ancestral populations and for unrelated subjects [6]. Recently, a PCs-based algorithm for determining ancestry along each chromosome from a high-density, genome-wide set of phased SNP genotypes of admixed individuals (PCAdmix) was developed for more than 3-population admixture [7]. When the data consist of families or related subjects, one alternative was proposed to calculate the eigenvectors (or their loadings) on the founders and married-ins only (i.e., the unrelated subjects). Then, using these eigenvectors one can determine the PCs for the related subjects as implemented in the R package SNPRelate [8]. However, when using these programs, if family structure is present and it is ignored, such sub-structure may induce artifactual PCs.
Principal Components for related subjects were first proposed by Konishi and Rao [9] for families with different number of siblings by taking into account between-families variability, where they highly recommended the use of a consistent estimator for the kinship matrix to calculate the eigenvectors in family data. Ott and Rabinowitz [10] introduced the principal components of heritability (PCH) that capture the familial information across phenotypes by calculating linear combinations of traits that maximize the heritability for combined phenotypes for family data. They showed that the first PCH as a quantitative trait in linkage analysis has a gain in power compared to the standard PCA. Within the PCH framework, Wang et al. [11] proposed a ridge penalized PCs based on heritability (PCHλ) for high dimensional family data by adding a penalty to the subject-specific variation. Oualkacha et al. [12] proposed an ANOVA estimator for the variance components for general pedigrees and high dimensional family data using the PCH framework. By using simulated pedigree data they showed that their proposed method yields similar results compared to the PCH and PCHλ.
The goal of our paper is to incorporate SNPs instead of quantitative traits in the PCHλ approach proposed by Oualkacha et al. [12] to calculate PCs that simultaneously correct for population stratification taking into account the family structure. For admixed populations, Thornton et al. [13, 14] proposed a method that takes into account the admixture in the kinship coefficients by modeling as a random effect using SNP data. Their method captures very well the relatedness between admixed subjects since it uses the ancestral population allele frequency for each SNP. The difference between these approaches is the latter includes the admixture in the polygenic random effect, i.e., the elements of kinship matrix are estimated taking into account the proportion of three ancestry populations, European, African and Asians, and our approach is to correct for population stratification using the PCs taking into account the family structure in the fixed effect.
The paper is organized as follows: in the Methods section we describe the methods to calculate PCs and the data sets used, one simulated data, and two real data, the GENOA sibship data consisting of European and African American subjects [15, 16] and the Baependi Heart Study consisting of 80 extended families collected from the highly admixed Brazilian population[17, 18], both with SNPs data from Affymetrix 6.0 chip. In the Results section we show our results using simulated and real data sets using the approaches described previously. In the Discussion and Conclusion sections we summarize our findings, introducing their advantages and limitations, and finalize with final observations. In addition, supplementary materials are also included to provide further clarification for the proposed methods.
Material and Methods
Principal component analysis for extended pedigrees
Let be a (nfp × 1) vector for all p variables and all members of the fth family, with E(Yf) = 1f ⊗μf and, Cov(Yf) = 2Φf ⊗Σg+If ⊗Σe where is (p×1) mean vector for all pf variables, with Σg and Σe as the (p×p) covariance matrix for p variables associated with polygenic and error component, respectively. For all F families we define as a (Np × 1) vector containing all p variables for all individuals, with with E (Y) = 1N ⊗ μf and Cov (Y) = Diag (2Φf) ⊗ Σg + IN ⊗ Σe, f = 1, …, F. This framework represents the multivariate family-based model [10, 12, 19, 20]. In this paper, the p variables represent the SNPs genotype (standardized or not) selected from the whole genome to estimate the global ancestry coefficients for individuals in family structures. More details about the model described here and the approaches described below are available in [21].
Principal component analysis for family data
Using the model described above, the PCs are calculated using unbiased or consistent estimators of the covariance matrices, Σg and Σe. Konishi and Rao proposed ANOVA based-estimators for these covariance matrices using nuclear families. From now on, the superscript K will represent this method [9]. For sibship of any size the kinship matrix is given by , and by using the model described above, the estimators for the covariance matrices can be obtained from the classical ANOVA results, as shown below,
| (1) |
with , Sw and Sb as the sum of square and cross-product matrices within and between families, respectively. Thus, the PCs can be obtained using the spectral decomposition of the matrices, , and , such that with .
For the standardized data, we use and , with DK as the diagonal matrix with elements , and as the diagonal elements of , j = 1, 2, …, p.
When the multivariate family-based model described above is extended for general pedigrees, Oualkacha et al. proposed to use ANOVA estimators for the variance component matrices [12]. By using the sum of square and cross-product matrices Sw and Sb in equations (1) and (2), the estimators of the covariance matrices are written as
| (2) |
| (3) |
where , , , , and . From now on, the superscript A will represent the Oualkacha method. When , the estimators in (2) and (3) are the same ANOVA estimators in (1) and (2) as proposed by Konishi and Rao [9]. The PCs are obtained by spectral decomposition of the matrices, , and , with . Furthermore, we can also use the correspondent correlation matrices, proposed by Bilodeau and Duchesne [22], for (2) and (3) to calculate the PCs, given by and , where DA is a diagonal matrix with elements , and as the diagonal elements of , j = 1, 2, …, p.
We also apply the Principal Component of Heritability (PCH) proposed by Ott and Rabinowitz [10] and extended by Oualkacha et al. [12] to determine the PCs for extended pedigrees. Here, instead of looking for the linear combinations of phenotypes with maximum variance, the focus is on the linear combination of phenotypes that maximizes its heritability that accounts for the intra-family correlation. Our goal is to find combinations of SNPs that maximize the trace of the heritability matrix, , which is equivalent to obtain the eigenvectors b such that and . Then we obtain these PCs using the eigenvectors of the generalized eigen system [23]. Due to the high-dimensional and sparse matrices (p ≫ N), Wang et al. proposed a ridge penalized principal components approach to obtaining the PCs of Heritability to accommodate large number of phenotypes [11]. In his approach one can use either the PCs from multivariate familial covariance matrix (for unstandardized data) or correlation matrix (for standardized data) as proposed by Bilodeau and Duchesne [22]. Then, the leading PC is defined as , with λ as the regularization parameter to be specified. When λ = 0, the PCHλ is the original non-penalized leading PC (PCH). When λ → ∞, the second term of the denominator of PCHλ dominates and the PCHλ approaches the linear combination that maximizes the between-family variation, , i.e., . When λ is between zero and infinity, the solution corresponds to the direction that provides the optimal balance between maximizing the between-family variation (PCB) and minimizing the within-family variation (PCW) defined as . These PCBs are the ones we use and recommend in our association analyses.
For comparison, we assume all subjects are unrelated. Thus, the covariance matrix using unrelated subjects, ΣU is estimated as
| (4) |
with , and as the overall mean with superscript U representing the unrelated subjects. This approach is equivalent to the PCs estimated by Price et al. [2]. The global ancestry coefficients are obtained from the spectral decomposition of the correlation matrix, with , with DU as the diagonal matrix with elements with , j = 1,2 …, p. The global ancestry coefficients may also be obtained from the singular value decomposition of the standardized data matrix, with the ith column (for ith individual) as .
Data Description
Real datasets
The first dataset is the GENOA sibship data that consists of European Ancestry (EA:Rochester, MN) and African American (AA: Jackson, MS) subjects with SNPs data from Affymetrix 6.0 chip [15, 16]. Sibships with one sibling were excluded for the analysis. For EA (AA), the screened data have 534 (548) families with 1,386 (1,263) individuals and data on 83,568 (50,510) SNPs. The second is the Baependi Heart Study that consisted of 119 extended Brazilian families with 1,712 individuals and SNPs data from Affymetrix 6.0 SNP chip [17]. Families with one individual or unrelated individuals with genotype data were excluded due to lack of information.
Simulated datasets
The simulated data consisted of 100 families with 20 subjects. Each family is assumed to have the same pedigree structure with 4 generations and 6 founders as described in Figure 1. The data were simulated under four scenarios, 100 SNPs and 3,000 SNPs, with Fst = 0.05 and 0.20. The genotype data was simulated using the Balding-Nichols model with allele frequencies for 2 distinct ancestral populations [24]. The allele frequency for ancestral population, pap, for all SNPs were generated independently from a Uniform distribution, U(0.1,0.9). The allele frequencies, in sub-populations separated by Wright’s Fst, were randomly selected from a Beta distribution, (Table 5). The admixture proportions for the 6 founders were randomly selected from a U(0, 1). Genotype data at each SNP for the 6 founders were generated based on the individual’s admixture proportions for the 2 sub-populations, and for the 14 non-founders were generated by dropping alleles down the pedigree. We used the R program, pca_admixture_unrelateds_and_pedigrees.r to simulate the data.
Figure 1.

The pedigree structure used for each family in the simulated dataset. Square indicates male while Circle indicates female.
Table 5.
Intercontinental autosomal genetic distances based on SNPs (Pairwise FST).
| CEU | Yoruba | Asian | |
|---|---|---|---|
|
| |||
| CEU | 0.153 | ||
| Yoruba | 0.111 | 0.190 | |
| Asian | 0.110 | 0.192 | 0.007 |
Results
For the sibship GENOA dataset, the following exclusion criteria were performed for each dataset using PLINK [25]: subjects with more than 5% missing genotype information were screened out, SNP with minor allele frequency (MAF) ≤ 5%, missingness ≤ 5%, LD (VIF=1.11 with window size 50, window shift 5 SNPs) after LD screening (−thin 0.2), HWE p-value ≤ 0.05; SNPs in the HLA region, chromosomes 8 and 17 inversion regions, X, Y and MT. The two final datasets include 9,224 common SNPs and 2,383 individuals (R:1,304;J:1,079) from 816 sibships (R:452;J:364). Detailed description of the sibship size and the number of families is described in Table 1.
Table 1.
Distribution of the family size from the GENOA study by European Ancestry (EA) and African American (AA) participants. Nf represents the number of siblings per family.
| Nf | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 14 | 17 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| EA | 264 | 101 | 37 | 24 | 9 | 6 | 4 | 3 | 1 | 1 | 1 | – | 1 |
| AA | 191 | 92 | 38 | 15 | 17 | 4 | 4 | 2 | – | – | – | 1 | – |
For this sibship data, the principal components using Konishi’s approach performed similarly for standardized and unstandardized genotype data (Figure 2 top). Furthermore, the decomposition of the matrix was more powerful and robust compared to the residual covariance matrix decomposition, , and provided similar results for the total covariance matrix (data not shown). This was due to the fact that the error covariance matrix, , is close to a null matrix, i.e., . However, when we assume that the data are unrelated, the principal components using Price’s approach show more variability within subjects and are sensitive to standardization, i.e., the unstandardized principal components results had the power to discriminate the two racial groups, but they are not sensitive to detect outlier families (Figure 2 bottom). Despite the fact the principal components plots are similar between the two approaches, the proportion of variance explained by their principal components differs. For the standardized data, the sum of three first principal components obtained by Konishi-Rao method was 10.1% compared to 4.98% by Price’s (Table 3 and supplementary Figure S1).
Figure 2.

Principal Components for standardized and unstandardized (top) and Variability and sensitivity of Price’s principal components using the GENOA sibship data (bottom).
Table 3.
Proportion-of-variation* explained by the first 10 PCs from Price’s and Oualkacha’s methods for standardized values using the GENOA data.
| PCs | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|---|---|---|---|---|---|---|---|---|---|---|
|
| ||||||||||
| Price | 0.044 | 0.0033 | 0.0025 | 0.0024 | 0.0023 | 0.0021 | 0.0020 | 0.0020 | 0.0019 | 0.0019 |
| Konishi-Rao | 0.089 | 0.0069 | 0.0052 | 0.0048 | 0.0046 | 0.0044 | 0.0042 | 0.0040 | 0.0039 | 0.0039 |
The proportion of variation explained by i-th PC was calculated as , where λi is the i-th largest eigen value, and n is the total number of eigen values for each method.
For the Baependi dataset, the same exclusion criteria used in the Genoa data were applied. The final dataset includes 80 families (1,109 individuals) and 8,764 SNPs. Detailed description of the sibship size and number of families is described in Table 2.
Table 2.
Distribution of the family size for the Baependi Heart Study. Nf represents the number of subjects per family.
| Nf | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 |
| Frequency | 2 | 8 | 6 | 8 | 5 | 1 | 4 | 6 | 5 | 6 | 1 | 5 | 3 |
| Nf | 16 | 18 | 19 | 21 | 24 | 27 | 32 | 46 | 48 | 60 | 61 | 68 | 93 |
| Frequency | 4 | 3 | 1 | 3 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
For this family data, both Oualkacha and Price standardized principal components show a similar population structure pattern with two long arms connected (Figure 3 top). Using the principal components 2 and 3, Oualkacha results show much more population structure than Price results, indicating that the first three PCs from Oualkacha are more informative than those from Price (Figure 3 bottom). Furthermore, the Oualkacha results allow more dispersion among individuals than Price results, mainly for individuals allocated between the long arms, indicating the relevance of the family relatedness (Figure 4) and that the population stratification information is retrieved from Σg (the kinship matrix) but not from Σe. This can be further observed in Table 4 and supplementary Figure S2, which show that the proportion of variation explained by the first 3 PCs of Oualkacha was 19.1% compared to 4.4% from Price’s.
Figure 3.

Comparison of Price’s and Qualkacha’s principal components using the Baependi family data. The plots of PC1 and PC2 are very similar between the two methods despite the scale difference (top); however for the PC2 and PC3 plots the two methods are drastically different.
Figure 4.

Distribution of some families, selected by size, from the Baependi data in the Price and Oualkacha principal components plots. One can observe the PCs’ arms differ between the two methods.
Table 4.
Proportion of variation* explained by the first 10 PCs from Price’s and Oualkacha’s methods for standardized values using the Baependi data.
| PCs | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|---|---|---|---|---|---|---|---|---|---|---|
|
| ||||||||||
| Price | 0.022 | 0.014 | 0.008 | 0.0069 | 0.0068 | 0.0061 | 0.0059 | 0.0055 | 0.0053 | 0.005 |
| Oualkacha | 0.086 | 0.070 | 0.035 | 0.031 | 0.028 | 0.026 | 0.0255 | 0.0249 | 0.0246 | 0.0239 |
The proportion of variation explained by i-th PC was calculated as , where λi is the i-th largest eigen value, and n is the total number of eigen values for each method.
Since we observed this difference in the proportion of variance between these two approaches using the admixed family data from Baependi, we also compared the kinship values derived from the IBD method implemented in PLINK [25] and from the REAP approach that used the allele frequencies from the three ancestral populations [14]. We observed that the relatedness coefficients obtained from these two programs were similar (supplementary Figure S3). Furthermore when we compared the association results for the quantitative trait average systolic blood pressure, using the R function lmekin in the R library coxme with the kinship matrix implemented in the R library kinship2 and in the REAP program, and including only age, and sex as covariates (supplementary Figure S4 left) and also adding PCB1 and PCB2 as covariates (supplementary Figure S4 right), the results are almost identical indicating that similar results were found by correcting the possible confounding due the population stratification through fixed or random effect. In addition, by not including the ancestral populations to estimate the kinship values the association results in admixed families were not affected.
For the simulated data, we run two different sets of simulation, with 100 and 3,000 SNPs, with same family structure and size of 20 members. The results presented here used Fst=0.2 only because its results are similar when using Fst=0.05 (data not shown). The top of Figure 5 show the principal components 1 and 2 for the Price’s and Oualkacha’s approaches for 100 SNPs. We do not observe any difference between the plots except for the different scales in the axes; however we observe the same difference pattern in the proportion of variance due to the PCs as we observe in the real data sets (Table 6 and supplementary Figure S5), i.e., the proportion explained by Oualkacha approach for the three principal components was 17.9% compared to 9.14% by Price’s. When we increase the number of SNPs to 3,000, we observe that the principal components 1 and 2 from Oualkacha approach could better capture the population substructure by taking into account the family structure (Fig 5, bottom). For better visualization we selected 4 families at random and highlighted them in Figure 5. One can observe the similar results from the Baependi data (Figure 4), i.e., the families were well separated using Oualkacha’s approach. The admixture proportions for each member of these four simulated families are shown in the supplementary Figure S6.
Figure 5.

Distribution of some randomly chosen families in the principal components plot from Price’s and Oualkacha’s methods using standardized genotype data and simulated data from 2,000 subjects and 100 SNPs (top)/3,000 SNPs (bottom).
Table 6.
Proportion of variation explained by the first 10 PCs from Price and Oualkacha methods (standardized versions) for 100 families (2,000 subjects) and 100 and 3,000 SNPs.
| PCs | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|---|---|---|---|---|---|---|---|---|---|---|
|
| ||||||||||
| SNPs | 100 | |||||||||
| Price | 0.058 | 0.017 | 0.0164 | 0.0163 | 0.0159 | 0.0157 | 0.0156 | 0.0151 | 0.0147 | 0.0145 |
| Oualkacha | 0.101 | 0.040 | 0.038 | 0.035 | 0.033 | 0.032 | 0.0305 | 0.029 | 0.028 | 0.014 |
|
| ||||||||||
| SNPs | 3, 000 | |||||||||
| Price | 0.039 | 0.004 | 0.004 | 0.004 | 0.004 | 0.0035 | 0.0035 | 0.0035 | 0.0035 | 0.0034 |
| Oualkacha | 0.072 | 0.016 | 0.016 | 0.015 | 0.015 | 0.015 | 0.015 | 0.015 | 0.0148 | 0.0147 |
All the analyses were performed using R 3.0.2 package [26]. The functions that calculated the familial PC are available upon request from the authors.
Discussion
It is the common practice in genome-wide association analysis to adjust for principal components to correct for population stratification; however for cluster-correlated data such as families, usually the family structure is ignored, and the PCs are estimated by using the loadings of the founders [8] or assuming all the subjects are unrelated [2]. This may work for homogeneous population but it was unclear if the same applied for admixed population. Admixture can be used as a tool for finding linked genes and also to detect difference from allelic association between loci [27]. It is also well known that the effect of population stratification in unrelated subjects in association analysis can induce false-positive results due to linkage disequilibrium and allele frequency distributions from different populations in candidate genes [29, 28, 30], a reasonable large number of SNPs [31], and lastly for thousands of SNPs [2]. However for clustered data or family data it was never investigated whether the family structure should be taken into account when calculating the PCs for admixed population.
Several approaches were used to include the family structure in the estimation of the PCs in family data. The approach developed by Konishi and Rao [9] are restricted to nuclear families or sibships. By using standardized genotype data and sibships from GENOA [16], we showed that the PCs, taking into account or not the family structure, were similar except for the scaling difference; however the proportion of variance explained by the first three PCs including the family structure are much larger than the PCs ignoring the family structure. With the approach proposed by us for admixed families with more than two generations, using genotype data from the Baependi study [17] and simulated data, one notices that the PCs plots that take into account the family structure were more informative for the admixture than the plots ignoring the family structure. Furthermore, the proportion of variance explained by the first three PCs when the family structure was incorporated in the estimation of the PCs, was larger than when family structure was ignored. To investigate the effect of the Price’s and Oualkacha’s PCs in the genome-wide association results, we performed three analyses, two including each of the two PCs approaches, Price and Oualkacha, and one without PCs. We used the five phenotypes related to Metabolic Syndrome adjusted for age, age2, and sex. We observed if no PCs were included there is either a decrease or increase in the significance level of SNPs depending on the phenotype; on the other hand if the PCs were included, the Oualkacha’s PCs yield less significant results than Price’s (data not shown).
We have also performed genome-wide association analyses using the Baependi data only and observed that by not including the ancestral populations in the estimation of the kinship values, the association results were not affected (supplementary Figure S4). Furthermore, the inclusion of the PCs in the association analysis using either of the two kinship values reduces the level of significance drastically (supplementary Figure S4).
Conclusion
The objective of this paper was to show that the family structure should be incorporated in the estimation of the PCs in admixed population. The conceived idea that when analyzing family data it is not necessary to take the family structure into account does not hold in general. This is the case for extended families from admixed population. We have used real and simulated datasets and both have shown the relevance of the family structure in the estimation of the principal components and their role in the association analysis.
Supplementary Material
Figure S1: Proportion of variation plots explained by the PCs from Price’s and Konishi-Rao’s using standardized genotype data from Genoa data.
Figure S2: Proportion of variation plots explained by the PCs from Price’s and Oualkacha’s methods using standardized genotype data from Baependi data.
Figure S3: Relatedness plots showing the results of the IBD method implemented in PLINK (left) and in REAP (right) using the Baependi family data.
Figure S4: Q-Q plots comparing the genome-wide association results using the standard kinship matrix (y-axis) and the kinship matrix from REAP (x-axis) and the quantitative trait, average systolic blood pressure, from the Baependi familiy data. The analyses were adjusted for age, age2, and sex using lmekin R function. On the left is the Q-Q plot excluding the PCB1 and PCB2, and on the right is the Q-Q plot including the PCB1 and PCB2.
Figure S5: Plots showing proportion of variation explained by the PCs from Price et al and Oualkacha et al methods (standardized versions) for 2,000 subjects, 100 SNPs using simulated data.
Figure S6: Plot of the admixture proportion for the 4 randomly selected families in Figure 5. The black squares represent the founders of respective families. The indices 1–20 on the x-axis represent the last two integers from the IID (individual ID) from the pedigree depicted in Figure 1) using simulated data 2,000 subjects.
Acknowledgments
We would like to thank DeLaine Anderson for her technical assistance with the manuscript and Dr. Thimothy Thornton to share his R function to simulated admixed families. The project was partially supported by the NIH grant HL87660 (MdA) and a Biomedical Statistics & Informatics (BSI) Merit Award, Department of Health Sciences Research, Mayo Clinic (DR).
Footnotes
Supplementary Materials
In this supplement we include extra information that may be of interest to the readers, and as referenced in the paper.
References
- 1.Devlin B, Roeder K. Genomic control for association studies. Biometrics. 1999;55:997–1004. doi: 10.1111/j.0006-341x.1999.00997.x. [DOI] [PubMed] [Google Scholar]
- 2.Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D. Principal Component analysis corrects for stratification in genome-wide association studies. Nature Genetics. 2006;38(8):904–909. doi: 10.1038/ng1847. [DOI] [PubMed] [Google Scholar]
- 3.Patterson N, Price AL, Reich D. Population structure and eigenanalysis. PLoS Genet. 2006;2(12):e190. doi: 10.1371/journal.pgen.0020190. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Prichard JK, Stephens M, Donnelly P. Inference of population structure using multilocus genotype data. Genetics. 2000;155:945–959. doi: 10.1093/genetics/155.2.945. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Falush D, Stephens M, Pritchard JK. Inference of population structure using multilocus genotype data: Linked loci and correlated allele frequencies. Genetics. 2003;164:1567–1587. doi: 10.1093/genetics/164.4.1567. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Price AL, Tandon A, Patterson N, Barnes KC, Rafaels N, Ruczinski I, Beaty TH, Mathias R, Reich D, Myers S. Sensitive detection of chromosomal segments of distinct ancestry in admixed populations. PLoS Genet. 2009;5(6):e1000519. doi: 10.1371/journal.pgen.1000519. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Brisbin A, Bryc K, Byrnes J, Zakharia F, Omberg L, Degenhardt J, Reynolds A, Ostrer H, Mezey JG, Bustamante CD. PCAdmix: Principal Components-Based Assignment of Ancestry along Each Chromosome in Individuals with Admixed Ancestry from Two or More Populations. Human Biology. 2012;84(4):343–364. doi: 10.3378/027.084.0401. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Zheng X, Levine D, Shen J, Gogarten SM, Laurie C, Weir BS. A High-performance Computing Toolset for Relatedness and Principal Component Analysis of SNP Data. Bioinformatics. 2012;28(24):3326–3328. doi: 10.1093/bioinformatics/bts606. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Konishi S, Rao CR. Principal components for multivariate familial data. Biometrika. 1992;79:631–641. [Google Scholar]
- 10.Ott J, Rabinowitz D. A principal-components approach based on heritability for combining phenotype information. Hum Hered. 1999;49:106–111. doi: 10.1159/000022854. [DOI] [PubMed] [Google Scholar]
- 11.Wang Y, Fang Y, Jin M. A ridge penalized principal-components approach based on heritability for high-dimensional data. Hum Hered. 2007;64:182–191. doi: 10.1159/000102991. [DOI] [PubMed] [Google Scholar]
- 12.Oualkacha K, Labbe A, Ciampi A, Roy MA, Maziade M. Principal components of heritability for high dimension quantitative traits and general pedigrees. Statistical Applications in Genetics and Molecular Biology. 2012 doi: 10.2202/1544-6115.1711. [DOI] [PubMed] [Google Scholar]
- 13.Thornton T, McPeek MS. ROADTRIPS: Case-Control association testing with partially or completely unknown population and pedigree structure. Am J Hum Genet. 2010;86:172–184. doi: 10.1016/j.ajhg.2010.01.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Thornton T, Tang H, Hoffmann TJ, Ochs-Balcom HM, Caan BJ, Risch N. Estimating kinship in admixed populations. Am J Hum Genet. 2012;91:122–138. doi: 10.1016/j.ajhg.2012.05.024. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Multi-Center Genetic Study of Hypertension: The Family Blood Pressure Program (FBPP) Hypertension. 2002;39:3–9. doi: 10.1161/hy1201.100415. [DOI] [PubMed] [Google Scholar]
- 16.Daniels PR, Kardia SL, Hanis CI, Brown CA, Hutchinson R, Boerwinkle E, Turner ST. Familial aggregation of hypertension treatment and control in the Genetic Epidemiology Network of Arteriopathy (GENOA) study. Am J Medicine. 2004;116:676–681. doi: 10.1016/j.amjmed.2003.12.032. [DOI] [PubMed] [Google Scholar]
- 17.Oliveira C, Pereira AC, de Andrade M, Soler JM, Krieger JE. Heritability of cardiovascuar risk factors in a brazilian population:Baependi heart study. BMC Med Genet. 2008;9(32) doi: 10.1186/1471-2350-9-32. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Giolo SR, Soler JMP, Greenway SC, Almeida MAA, de Andrade M, Seidman JC, Seidman CE, Krieger JE, Pereira AC. Brazilian urban population genetic structure reveals a high degree of admixture. Eur J Human Genet. 2011;19:111–116. doi: 10.1038/ejhg.2011.144. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.de Andrade M, Gueguen R, Visvikis S, Sass C, Siest G, Amos CI. Extension of variance components approach to incorporate temporal trends and longitudinal pedigree data analysis. Genetic Epidemiol. 2002;22:221–232. doi: 10.1002/gepi.01118. [DOI] [PubMed] [Google Scholar]
- 20.Lange K. Mathematical and Statistical Methods for Genetic Analysis. 2. New York: Springer-Verlag; 2002. revised. [Google Scholar]
- 21.de Andrade M, Soler JMP. Multivariate Polygenic Mixed Model in Admixed Population. Revista da Estatistica UFPOP. 2014;3(2):200–2009. [Google Scholar]
- 22.Bilodeau M, Duchesne P. Principal component analysis from the multivariate familial correlation matrix. J Multivariate Analysis. 2002;82:457–470. [Google Scholar]
- 23.Mardia KV, Bibby JM, Kent JT. Multivariate analysis. London: Academic Press; 1979. [Google Scholar]
- 24.Balding DJ, Nichols RA. A method for quantifying differentiation between populations at multi-allelic loci and its implications for investigating identity and paternity. Genetica. 1995;96:3–12. doi: 10.1007/BF01441146. [DOI] [PubMed] [Google Scholar]
- 25.Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, Maller J, Sklar P, de Bakker PIW, Daly MJ, Sham PC. PLINK: a toolset for whole-genome association and population-based linkage analysis. Am J Human Genet. 2007;81:559–575. doi: 10.1086/519795. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.R Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing; Vienna, Austria: 2014. URL http://www.R-project.org/ [Google Scholar]
- 27.Chakraborty R, Weiss KM. Admixture as a tool for finding linked genes and detecting that difference from allelic association between loci. Proc Natl Acad Sci USA. 1988;85:9119–9123. doi: 10.1073/pnas.85.23.9119. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Goddard KA, Hopkins PJ, Hall JM, Witte JS. Linkage disequilibrium and allele frequency distributions for 114 single-nucleotide polymorphisms in five populations. Am J Hum Genet. 2000;66:216–234. doi: 10.1086/302727. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Thomas DC, Witte JS. Point:Population stratification: A problem for case-control studies of candidate gene associations? Cancer Epidemiol Biomarkers Prev. 2002;11:505–512. [PubMed] [Google Scholar]
- 30.Cardon LR, Palmer LJ. Population stratification and spurious allelic association. Lancet. 2003;361:598–604. doi: 10.1016/S0140-6736(03)12520-2. [DOI] [PubMed] [Google Scholar]
- 31.Marchini J, Cardon LR, Phillips MS, Donnelly P. The effects of human population structure on large association studies. Nat Genet. 2004;36:512–517. doi: 10.1038/ng1337. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Figure S1: Proportion of variation plots explained by the PCs from Price’s and Konishi-Rao’s using standardized genotype data from Genoa data.
Figure S2: Proportion of variation plots explained by the PCs from Price’s and Oualkacha’s methods using standardized genotype data from Baependi data.
Figure S3: Relatedness plots showing the results of the IBD method implemented in PLINK (left) and in REAP (right) using the Baependi family data.
Figure S4: Q-Q plots comparing the genome-wide association results using the standard kinship matrix (y-axis) and the kinship matrix from REAP (x-axis) and the quantitative trait, average systolic blood pressure, from the Baependi familiy data. The analyses were adjusted for age, age2, and sex using lmekin R function. On the left is the Q-Q plot excluding the PCB1 and PCB2, and on the right is the Q-Q plot including the PCB1 and PCB2.
Figure S5: Plots showing proportion of variation explained by the PCs from Price et al and Oualkacha et al methods (standardized versions) for 2,000 subjects, 100 SNPs using simulated data.
Figure S6: Plot of the admixture proportion for the 4 randomly selected families in Figure 5. The black squares represent the founders of respective families. The indices 1–20 on the x-axis represent the last two integers from the IID (individual ID) from the pedigree depicted in Figure 1) using simulated data 2,000 subjects.
