Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 Feb 1.
Published in final edited form as: J Appl Genet. 2019 Nov 21;61(1):75–86. doi: 10.1007/s13353-019-00526-7

The impact of disregarding family structure on genome-wide association analysis of complex diseases in cohorts with simple pedigrees

Alireza Nazarian 1,*, Konstantin G Arbeev 1, Alexander M Kulminski 1,*
PMCID: PMC6980752  NIHMSID: NIHMS1544268  PMID: 31755004

Abstract

The generalized linear mixed models (GLMMs) methodology is the standard framework for genome-wide association studies (GWAS) of complex diseases in family-based cohorts. Fitting GLMMs in very large cohorts, however, can be computationally demanding. Also, the modified versions of GLMM using faster algorithms may underperform, for instance when a single nucleotide polymorphism (SNP) is correlated with fixed-effects covariates. We investigated the extent to which disregarding family structure may compromise GWAS in cohorts with simple pedigrees by contrasting logistic regression models (i.e., with no family structure) to three LMMs-based ones. Our analyses showed that the logistic regression models in general resulted in smaller P-values compared to the LMMs-based models, however, the differences in P-values were mostly minor. Disregarding family structure had little impact on determining disease-associated SNPs at genome-wide level of significance (i.e., P < 5E-08) as the four P-values resulted from the tested methods for any SNP were all below or all above 5E-08. Nevertheless, larger discrepancies were detected between logistic regression and LMMs-based models at suggestive level of significance (i.e., of 5E-08 ≤ P < 5E-06). The SNPs effects estimated by the logistic regression models were not statistically different from those estimated by GLMMs that implemented Wald’s test. However, several SNP effects were significantly different from their counterparts in LMMs analyses. We suggest that fitting GLMMs with Wald’s test on a pre-selected subset of SNPs obtained from logistic regression models can ensure the balance between the speed of analyses and the accuracy of parameters.

Keywords: Complex Diseases, Family-based GWAS, Logistic Regression, GLMMs Framework

Introduction

Genome-wide association studies (GWAS) are commonly used for exploring the genetic architecture of complex traits. The standard GWAS framework assumes that subjects in the case and control groups are unrelated (Aulchenko et al. 2010). Analyzing family-based cohorts, however, may provide additional benefits as, for example, they can be more robust to the confounding effects of population admixture and environmental factors (Evangelou et al. 2006), and be more powerful when the underlying genetic variants are rare or of small effect sizes, or the phenotype of interest is not common (Zondervan and Cardon 2007). The linkage disequilibrium (LD) pattern and allele frequency of trait-associated variants in family-based cohorts might represent some differences from the general population leading to biased estimates of genetic parameters obtained from the standard fixed-effects GWAS framework (McArdle et al. 2007).

To account for potential confounding effects of familial clustering several approaches have been suggested such as using a family-based allelic association test (e.g., transmission disequilibrium test (TDT) and its extensions) (Spielman et al. 1993; Gordon et al. 2004), including only a subset of unrelated subjects in the analyses (e.g., one subject per any closely related cluster of individuals) (Lee et al. 2008), and using the linear mixed models (LMMs) framework to model within-family correlations (Eu-ahsunthornwattana et al. 2014).

The LMMs-based genetic analyses have become popular in recent years due to their ability to account for familial clustering and cryptic relatedness as well as for the global population stratification (Price et al. 2010; Lloyd-Jones et al. 2018). The generalized linear mixed models (GLMMs) methodology that includes a family structure as a random-effects covariate is the gold standard framework for exploring the genetic architecture of dichotomous traits in family-based studies, particularly, when the study design is not appropriate for a TDT-based test (e.g., when the dataset consists of a mixture of singletons and families) (Manichaikul et al. 2012). The main drawback, however, is that fitting GLMMs on millions of single nucleotide polymorphisms (SNPs) can be computationally intensive, particularly in samples consisting of thousands of individuals as the computing time and memory usage increase non-linearly with the sample size. To overcome such computational burdens several modified variants of GLMM-based analyses with faster or two-step algorithms have been introduced (Aulchenko et al. 2010; Eu-ahsunthornwattana et al. 2014). Such two-step methods can considerably speed up the analyses at a potential expense of statistical power for example when SNPs were correlated with fixed-effects covariate(s) (Aulchenko et al. 2007, 2010; Kang et al. 2010). It has been also suggested that LMMs, as a first order Taylor’s approximations to GLMMs, can be used in GWAS of dichotomous traits although such models do not consider a correct link function for relating outcome and explanatory variables (Zhou et al. 2013).

A previous simulation study demonstrated that when family structure was excluded from the genetic analyses of complex traits, the estimated effect sizes of genetic variants and the statistical power of analyses were not significantly biased, albeit the type-I error rates slightly increased whose magnitude was dependent on the narrow-sense heritability (h2) of the trait and the degree of relatedness of subjects (McArdle et al. 2007). Still, it would be desirable to explicitly evaluate the impact of exclusion of family structure from the association analyses of complex diseases using real data as simulation studies often require specific assumptions which may not hold in reality. The main objective of the present study was to investigate the extent to which disregarding family structure may compromise the results from GWAS of family-based cohorts with simple pedigree structures (i.e., samples with two to three generations of mostly small-size families) that are the most common familial data used for the genetic analyses of complex diseases. We performed GWA and gene-based analyses of Alzheimer’s disease (AD) with h2 = 58% - 74% (Gatz et al. 1997, 2006) and hypertension (HTN) with h2 = 30% - 61% (Kupper et al. 2005, 2006; Tang et al. 2006; Shih and O’Connor 2008; Vattikuti et al. 2012) in two independent cohorts to compare four alternative methods of interest, i.e., the logistic regression analysis with no family structure in the models as well as three LMMs-based methods including GLMMs with Wald’s test, GLMMs with score test, and LMMs.

Methods

Study Participants

The genetic architecture of AD and HTN were investigated in two independent datasets with family-based designs: 1) National Institute on Aging’s Late-Onset Alzheimer’s Disease Family Study (NIA-LOADFS) (Lee et al. 2008), and 2) Framingham Heart Study (FHS) (Dawber et al. 1951; Feinleib et al. 1975; Splansky et al. 2007). These two studies provided data for mixtures of singletons and two- or three-generational families, mostly of Caucasian ancestry. Therefore, only self-identified Caucasian subjects were included in our study which would provide the largest samples sizes and the highest statistical power compared to other ethnicities and would minimize the population stratification.

Basic demographic and pedigree structure information about study participants from LOADFS and FHS can be found in Table S1. Around 26–29% of LOADFS subjects and 8–16% of FHS subjects that were included in our study were singletons. The others were members of 278–623 LOADFS families and 697–866 FHS families. The median sizes of families were between 3 and 5 with 6–10% of families in LOADFS and 9–22% of families in FHS having more than 10 members. Also, ~1% of FHS families consisted of more than 50 subjects. The parents-offspring and sibship (PS) relationships were found in 95–99% of LOADFS families and 83–97% of FHS families in. Also, extended family (EF) relationships (e.g., grandparents, aunts, uncles, cousins) were found in 32–52% and 22–62% of families in LOADFS and FHS, respectively. The numbers of PS relationships exceeded those of EF relationships in 77–89% of LOADFS families and 65–74% of FHS families.

The AD cases in both LOADFS and FHS as well as the HTN cases in the LOADFS were directly identified by the study investigators. The HTN cases in the FHS were determined as those that had blood pressure ≥ 140/90 in three or more exams or had received anti-HTN medications. For the analysis of AD, subjects from the third generation of the FHS were excluded to make the two studies age-comparable. Also, the genetic analysis of HTN in the LOADFS was performed on a portion of subjects that had consented to be included in the cardiovascular diseases-related research.

GWA Analysis

Logistic regression models were fitted using PLINK package (v1.07) (Purcell et al. 2007) to estimate the SNPs additive effects on AD/HTN after adjustment for fixed-effects covariates including the birth year, sex, and top 5 principal components (PCs) of genotype data (Nazarian et al. 2018). The principal component analysis was performed by the GENESIS R package (Conomos et al. 2015) that can minimize the impact of known or cryptic relatedness of subjects in order to ensure that the calculated PCs mainly reflect the population stratification and are not biased by the family structure. Table S1 contains the numbers of SNPs analyzed by our GWA analyses.

The SNPs that had PPLINK < 0.05 were subject to the second step of our analyses in which the PLINK models were compared with three alternative LMMs-based methods. The LMMs-based analyses included family structure in the fitted models as a random-effects covariate along with all aforementioned fixed-effects covariates. The three methods under consideration were:

A). GLMM-W

GLMMs containing family-ID as a random-effects covariate were fitted and a Wald’s test was used to test if the SNPs effects were significant. Under this scenario, the full model including all fixed- and random-effects covariates was fitted for every SNP. These models were fitted by lme4 R package (Bates et al. 2015).

B). GLMM-S

GLMMs were fitted with the family structure modeled by including a normalized additive genomic relationship matrix (GRM) (Nazarian and Gezan 2016) as a random-effects covariate. The GRM was calculated over all SNPs from autosomal chromosomes. These models were fitted using GMMAT R package (Chen et al. 2016), which implemented a two-step procedure in which a null logistic mixed model (i.e., not containing any SNPs) was fitted first to estimate the fixed and random effects and variance-covariance matrices. The estimated parameters from this step were then used for estimating the association signals of SNPs by a score test.

C). LMM-W

as with GLMM-S the family structure was accounted for by including a normalized additive GRM as a random-effects covariate in the models and a two-step procedure was implemented to estimate the SNPs effects. The difference was that here the phenotype data (i.e., AD and HTN status) were treated as quantitative rather binary traits. Also, a Wald’s test was used instead of the score test. The LMMs were fitted in a leave one chromosome out (LOCO) manner, i.e., once the effects of SNPs located on a given chromosome were estimated, the genotype information for SNPs on that chromosome was excluded from the GRM. The GCTA package (v1.26.0) (Yang et al. 2011) was used to fit these models.

Gene-based analyses

The summary results from each of the GWA analyses were used to perform a gene-based analysis in which the z-statistics of a set of SNPs located within ±50 kb of a given gene were combined into a single test statistic for measuring the association of that gene with the diseases of interest. These analyses were performed using GCTA package (v1.26.0) (Yang et al. 2011). The SNPs corresponding to each gene were first pruned based on their pair-wise LD to remove one of each SNPs pair that had r2 > 0.9. The compound test statistic was then generated as the quadratic form of the vector of z-statistics of pruned SNPs set (i.e., T = z’Iz, where z is the vector of z-statistics with a multivariate normal distribution and I is an identity matrix) and its distribution was estimated by an approximation method (e.g., Satterthwaite method) (Bakshi et al. 2016).

Criteria for Comparing Methods

P-values

for each tested method, the log-transformed probability of the effect of any SNP per individual (i.e., ε = −log10(P-value)/N, where N is the sample size) was calculated as the efficiency measure of that method. The relative efficiency of PLINK to a given LMMs-based method (i.e., ρ=εPLINKεLMMsbased) averaged over the investigated SNPs was used to assess the overall differences of the two methods (Kulminski et al. 2016). An average ρ > 1 indicates that the P-values from PLINK were in general smaller than those from an alternative method and vice versa. The larger the deviation of ρ from one is, the larger the difference of the P-values from the two methods is.

The impact of disregarding or modeling family structure on significant findings was also investigated for SNPs that had significant PPLINK at genome-wide (i.e., 5E-08) or suggestive (i.e., 5E-08 ≤ P < 5E-06) significance levels. For a given PLINK vs. LMMs-based comparison, we determined the proportion of SNPs with significant P-values in PLINK that had non-significant P-values in the LMMs-based method. Likewise, the proportion of SNPs with non-significant PPLINK that were significantly associated with diseases of interest in the LMMs-based method was identified.

We also implemented genomic control (GC) for correcting the estimated test statistics from PLINK by the genomic inflation factor (i.e., λ value) from logistic regression models (Devlin and Roeder 1999). The resulting GC-corrected P-values (i.e., PPLINK-GC) were then compared to their respective LMMs-based P-values as explained above. Also, the same measures (i.e., ε values, ρ ratios, and the proportion of significant/non-significant findings in PLINK analyses that had non-significant/significant P-values in LMMs-based models) were used to compare the results of gene-based analyses.

SNPs Effects (i.e., β coefficients)

the effect of each SNP estimated by PLINK was compared to its counterparts from the LMMs-based methods using a Wald’s chi-square test with 1 degrees of freedom (Allison 1999) in order to investigate if the two estimated βs were significantly different at P < 0.05.

χ2=(βPLINKβX)2sePLINK2+seX2,

where β PLINK and βX are the SNPs effects (i.e., the natural logarithm of odds ratios) estimated by PLINK and a given LMMs-based method, respectively, and sePLINK and seX are their corresponding standard errors. The number of significantly different findings for any comparison was considered as a measure of the methods differences at the SNPs effects level. Since dichotomous traits were treated as continuous phenotypes when LMMs were fitted (i.e., a Gaussian instead of a Logit link function), the SNPs effects estimated by the LMM-W method on the observed scale were transformed to odds ratios under an additive risk model (Lloyd-Jones et al. 2018) before performing aforementioned comparisons.

Results

Direction of changes of P-values

our GWA analyses revealed that 107676 (LOADFS-AD), 92523 (FHS-AD), 124214 (LOADFS-HTN), and 112626 (FHS-HTN) SNPs had PPLINK < 0.05. These SNPs were subject to the comparative analyses of the four methods under consideration. Figure 1 shows the distributions of P-values of these SNPs across tested methods. As expected, disregarding family structure in PLINK models in general led to smaller P-values compared to the LMMs-based analyses (i.e., GLMM-W, GLMM-S, and LMM-W models) that included the family structure in the models (Tables S2S5). Overall 20–37% of SNPs had smaller P-values in the GLMM-W models compared to PLINK, although, PGLMM-W was always larger than PPLINK for the SNPs with PPLINK < 5E-08. Also, 2–7% and 17–28% of SNPs had smaller P-values from the GLMM-S and LMM-W models, respectively, compared to the PLINK models. However, all SNPs with PPLINK < 5E-32 had smaller PGLMM-S and PLMM-W than PPLINK.

Figure 1:

Figure 1:

Distributions of P-values of SNPs across tested methods

Magnitude of changes of P-values

the results of comparing P-values of SNPs among tested methods are summarized in Tables S6S9. Also, Figure 2 plots the average ρ ratios corresponding to the pair-wise comparison of P-values of SNPs between methods of interest at different significance levels. The P-values from GLMM-W were slightly larger than PLINK with overall ρ ratios of 1.027 to 1.126 across tested cohort/disease scenarios. Their differences, however, were more prominent for SNPs that had very small PPLINK (e.g., ρ = 1.41 at significance level of 5E-32). Figure S1 depicts the number of SNPs with ρ < 1 or ρ ≥ 1 when their PPLINK or PPLINK-GC were compared to the LMMs-based methods.

Figure 2:

Figure 2:

The average ρ ratios corresponding to the pair-wise comparison of P-values of SNPs between methods of interest at different significance levels (grouped based on their PPLINK)

The differences between the P-values from the PLINK and those from the GLMM-S and LMM-W models were more noticeable with average ρ ratios of 1.067–1.163 and 1.148–1.227 for the two comparisons, respectively. In general, the ρ ratios were greater than 1 except for SNPs with PPLINK < 5E-16 for which the PLINK mostly resulted in larger P-values than the GLMM-S and LMM-W. For these SNPs the LMM-W resulted in the smallest P-values with average ρ ratios of 0.862–0.973 (Tables S6S7 and Figure S1). Correcting PPLINK by the genomic inflation factor resulted in P-values that in general were larger than PGLMM-W (54–77% of SNPs had PPLINK-GC > PGLMM-W across different test scenarios with maximum overall ρ ratio of 1.004). However, these adjustments were not able to properly correct the P-values of SNPs with very small P-values in PLINK models (e.g., PPLINK < 5E-16) which mostly had PPLINK-GC < PGLMM-W with average ρ ratios of 1.05–1.30. The opposite pattern was observed when PLINK-GC was compared to the GLMM-S and LMM-W methods as PPLINK-GC of most SNPs were smaller than those obtained from the GLMM-S and LMM-W models (i.e., 45–62% and 58–61% of SNPs, respectively, with the maximum overall ρ ratios of 1.029 and 1.085 for the two comparisons). Also, for SNPS with PPLINK < 5E-16, PPLINK-GC were always greater than PGLMM-S and PLMM-w with average ρ ratios between 0.795 and 0.949 (Tables S6S7). When SNPs were grouped by their effect allele frequencies (EAF), no considerable differences were observed in ρ ratios for SNPs with common (EAF ≥ 5%) vs. rare (EAF < 5%) alleles (Table S8S9).

Significant findings from PLINK

The numbers (and proportions) of SNPs that had significant PPLINK at genome-wide (i.e., 5E-08) or suggestive (i.e., 5E-08 ≤ P < 5E-06) significance levels but were not significantly associated with the diseases under consideration in LMMs-based models can be found in Tables S10. Figure 3 shows the distributions of P-values of these SNPs obtained from different methods. Also, the distributions of P-values of the SNPs that did not attain significant PPLINK at genome-wide or suggestive levels of significance are displayed in Figures S2 and S3; and information regarding the numbers (and proportions) of subsets of these SNPs that were significantly associated with AD and HTN by LMMs-based methods can be found in Table S11.

Figure 3:

Figure 3:

Distributions of P-values of SNPs that had significant PPLINK at genome-wide (i.e., P < 5E-08) or suggestive (i.e., 5E-08 ≤ P < 5E-06) levels of significance

At genome-wide significance threshold, all SNPs that attained PPLINK < 5E-08 had PGLMM-W, PGLMM-S, and PLMM-W < 5E-08 as well and vice versa. Also, when GC-adjusted P-values were compared, all SNPs with significant PPLINK-GC had significant P-values from LMMs-based. Nevertheless, 1 SNP (0.001%) that had PPLINK-GC > 5E-08 attained P < 5E-08 in the three LMMs-based methods.

At suggestive level of significance more discrepancies were observed between PLINK and LMMs-based methods. For example, 39–73% of SNPs that were significantly associated with AD and HTN in PLINK analyses (i.e., PPLINK < 5E-06) had PGLMM-W ≥ 5E-06 across different test scenarios (i.e., overall 60 out of 99 SNPs). The PLINK vs. GLMM-S and LMM-W comparisons revealed slightly more discrepancies with the rates of 35–91% (i.e., 68 and 69 out of 99 SNPs with significant PPLINK had PGLMM-S and PLMM-W ≥ 5E-06, respectively). When PLINK-GC was compared to GLMM-W, GLMM-S, and LMM-W, the number of such findings decreased. We found that among 24 SNPs that had PPLINK-GC < 5E-06, there were 2, 2, and 5 SNPs that had non-significant P-values in the three LMMs-based methods, respectively.

In addition, our analyses revealed that less than 0.009% of SNPs that did not attain suggestive level of significance in PLINK models were associated with the diseases of interest in other models (i.e., 8, 2 and 19 SNPs with PPLINK ≥ 5E-06 had PGLMM-W, PGLMM-S and PLMM-W < 5E-06, respectively). Also, comparing PLINK-GC with GLMM-W, GLMM-S, and LMM-W, there were 26, 12, 31 SNPs, respectively, that had non-significant PPLINK-GC but their P-values were smaller than 5E-06 in these methods.

SNPs Effects

as expected, the β coefficients from the GLMM-W models had the same signs (i.e., the same direction of effects) as those from the PLINK models. The differences between the estimated βPLINK and βGLMM-W were smaller than 0.1 for more than 99% of SNPs under tested cohort/disease scenarios, which is equivalent to relative odds ratios (i.e., exp(βPLINK)exp(βLMMsbased)) of around 0.9 to 1.1. The β-differences were more prominent for the SNPs that had very small PPLINK. For instance, almost all SNPs with PPLINK < 5E-32, had β-differences ≥ 0.1. However, none of the detected differences were statistically significant at P < 0.05 when a Wald’s chi-square test was used to compare the β PLINK with βGLMM-W (Tables S12S15). Also, all βLMM-W had the same sign as βPLINK but the β-differences of 2–22% of SNPs were ≥ 0.1. When the β coefficients from the two methods were compared by a Wald’s chi-square test, in general less than 1% of SNPs had significantly different β PLINK and βLMM-W at P < 0.05. However, the percentage of SNPs that had β-differences ≥ 0.1 was higher for SNPs that had smaller PPLINK. For example, at significance level of 5E-08, all SNPs had β-differences ≥ 0.1 and the Wald’s chi-square testing revealed that βPLINK and βLMM-W of almost all of them were significantly different (Tables S12S15).

The GLMM-S which used a score test to measure the significance of disease-SNPs associations provided score coefficients instead of the effect sizes (β) which could only show the direction of effects and had different interpretations than the SNP effects estimated by the PLINK, GLMM-W, and LMM-W methods. However, as expected, for all the tested SNPs the signs of score coefficients were the same as the signs of βs from the PLINK.

Gene-based analyses

the results of comparing P-values of genes among tested methods are summarized in Tables S16S19. Also, Figure S4 shows the distributions of P-values of genes with PPLINK < 0.05 across tested methods. The linear plots corresponding to the average ρ ratios corresponding to the pair-wise comparisons of P-values of genes between methods of interest are shown in Figure S5. As expected, the gene-based analyses using GWA results from the PLINK models in general produced smaller P-values than their counterpart LMMs-based analyses. At gene-based PPLINK < 0.05, up to 12%, 3%, and 22% of genes had smaller P-values from the GLMM-W, GLMM-S, and LMM-W analyses, respectively, compared to the PLINK, and the maximum ρ ratios were 1.244, 1.306, and 2.434. At gene-based PPLINK < 5E-06, the ρ ratios were 1.071–1.204, 0.997–1.321, and 0.939–1.366, respectively for these comparisons. However, at gene-based PPLINK < 5E-32, the ρ ratios were 1.334, 0.953, and 0.879 indicating that the GLMM-W was still more conservative than PLINK (i.e., no gene had PPLINK > PGLMM-W), whereas the GLMM-S and LMM-W were more liberal than PLINK (i.e., all genes had PPLINK > PGLMM-S and PLMM-W). Among genes that had PPLINK < 0.05, 54–86% had PGLMM-W < PPLINK-GC (overall ρ ratios between 0.950 and 0.997). However, PGLMM-W of all genes that had PPLINK < 5E-32 were greater than their PPLINK-GC (ρ = 1.229). On the other hand, the GLMM-S and LMM-W methods mostly resulted in larger P-values than PLINK-GC at PPLINK < 0.05 (34–57% and 41–43% of genes had PGLMM-S < PPLINK-GC and PLMM-W < PPLINK-GC, respectively) with ρ ratios of 0.995–1.048 and 1.081–2.209, respectively. At gene-based PPLINK < 5E-32, all genes had smaller P-values from the GLMM-S and LMM-W methods compared to the PLINK-GC (ρ = 0.878 and 0.81, respectively) (Tables S16S19).

The results of comparative analysis of the four methods of interest for genes that had significant and non-significant PPLINK at significance levels of 5E-08 and 5E-06 can be found in Tables S20S21, respectively. Also, Figure S6 shows the distributions of P-values of genes that had significant PPLINK at these significance levels. The distributions of P-values of genes that had non-significant PPLINK at these significance levels can be seen in Figures S7 and S8.

At significance level of 5E-08, P-values resulting from PLINK and LMMs-based tested methods for any genes were all smaller than 5E-08 or all larger than 5E-08. The same was observed when contrasting PLINK-GC with LMMs-based results.

At significance level of 5E-06, all genes with non-significant PPLINK had non-significant P-values in LMMs-based analyses as well. On the other hand, several genes that were significantly associated with AD or HTN in PLINK analyses (i.e., 5E-08 ≤ PPLINK < 5E-06) had non-significant LMMs-based P-values. The largest discrepancy was observed between PLINK and LMM-W method where 8 out of 9 genes with significant PLINK had PLMM-W > 5E-06. Only 2 and 3 of these 9 genes had PGLMM-W and PGLMM-S > 5E-06, respectively. When PLINK-GC was compared with LMMs-based analyses at significance level of 5E-06, all 5 genes that had significant PPLINK-GC had P-values > 5E-06 in LMM-W analyses (but had significant PGLMM-W and PGLMM-S as well). Also, among genes that had non-significant PPLINK-GC, 2, 1, and 1 genes (up to 0.081%−0.163% of genes) had P < 5E-06 from the GLMM-W, GLMM-S, and LMM-W methods, respectively).

Discussion

In this study, we primarily aimed to investigate to what extent the SNPs-disease association results from logistic regression analyses disregarding the family structure would be different from GLMMs that fit a full model that contains all fixed- and random-effects covariates for every SNP under consideration (i.e., PLINK vs. GLMM-W methods). However, since fitting genome-wide GLMM-W in very large cohorts can be computationally demanding, we additionally compared the performance of the PLINK analyses with two alternative LMMs-based methods that implemented two-step algorithms to fit models of interest (i.e., GLMM-S and LMM-W). Such two-step variants were introduced to decrease the computational burdens of genome-wide GLMM analyses (Aulchenko et al. 2010; Eu-ahsunthornwattana et al. 2014), although they were suggested to suffer from the loss of statistical power, particularly when SNPs are correlated with one or more fixed-effects covariates or when SNPs have large effects (Aulchenko et al. 2007, 2010; Kang et al. 2010).

We found that the PLINK models in general resulted in smaller P-values compared to the GLMM-W analyses, however, the differences between these two methods measured by the average ρ ratios were in most cases minor. Larger differences in P-values from these two methods were observed for SNPs with very small PPLINK (e.g., P < 5E-16) which may indicate that familial clustering had more impact on these loci. The GC-corrected P-values, on the other hand, were mostly larger than PGLMM-W, albeit, almost all SNPs with very small PPLINK had smaller PPLINK-GC than PGLMM-W indicating the insufficiency of GC-adjustment for these SNPs. In fact, while the GC-adjustment can efficiently control for the global population structure, it may not sufficiently address the subtle population structure at local genomic regions (Price et al. 2010; Qin et al. 2010).

Comparing PLINK and GLMM-S, we found that the overall average ρ ratios were slightly larger than those observed in PLINK vs. GLMM-W comparison. We also found that PLINK vs. LMM-W had the largest overall average ρ ratios implying that PLMM-W were mostly larger than PPLINK and their differences in general were more prominent than those from the other pair-wise comparisons of tested methods.

However, in the case of SNPs with very small PPLINK (e.g., P < 5E-16), PLMM-W were mostly smaller than the other methods. Similarly, GLMM-S also resulted in P-values that were smaller than those from the PLINK and PLINK-GC for these SNPs. This can be indicative of the poor performance of the GLMM-S and LMM-W for correcting family structure for such SNPs, particularly that GLMM-W resulted in larger P-values for these SNPs compared to PLINK.

Our analyses showed that disregarding family structure had small impact on determining disease-associated SNPs at genome-wide level of significance (i.e., all four P-values resulted from PLINK and LMMs-based models for a given SNP were all below or all above 5E-08). The same was observed when GC-corrected P-values were compared to those obtained from LMMs-based methods (Tables S10S11). However, at suggestive level of significance (i.e., of 5E-08 ≤ P < 5E-06), almost two-thirds of SNPs with significant PPLINK had P-values above 5E-06 in LMMs-based methods. When the GC-corrected P-values were compared to their counterparts from the LMMs-based methods, as expected, the numbers of SNPs with PPLINK-GC < 5E-06 and non-significant LMMs-based P-values decreased notably to around 8–21%. In general, such discrepancy rates were slightly higher in the PLINK/PLINK-GC vs. LMM-W comparison than other comparisons, and were slightly lower in in the PLINK/PLINK-GC vs. GLMM-W comparison than other comparisons. There were also several SNPs that had P-values < 5E-06 in LMMs-based models whereas their PPLINK were above 5E-06. The number of such SNPs was higher in PLINK-GC vs. LMMs-based methods than comparing PLINK vs. LMMs-based models. In general, the highest discrepancy rates were in PLINK/PLINK-GC vs. LMM-W comparison, and the lowest in PLINK/PLINK-GC vs. GLMM-S comparison (Tables S10S11).

In terms of the SNPs effects, none of the SNPs had significantly different βPLINK and βGLMM-W and the β differences of the two methods were smaller than 0.1 for more than 99% of SNPs. However, there were several SNPs with significantly different βPLINK and βLMM-W particularly among SNPs with very small P-value whose β differences were almost always greater than 0.1. This might be a disadvantage for the LMM-W method as it was previously suggested that disregarding family structure would not significantly impact the estimated β coefficients (McArdle et al. 2007). The utility of the score coefficients resulted from the GLMM-S method were limited as such scores have different interpretation than β coefficients estimated by the other methods. For example, they would only show the direction of effects and not the effects sizes. Also, such score coefficients may only be used in meta-analyses that combine scores from multiple studies, but cannot be combined with summary results from studies that estimated β coefficients.

In summary, this study demonstrated that the logistic regression models in general resulted in smaller P-values compared to LMMs-based analyses. Nevertheless, the differences in the estimates between these models were small in most cases. The most prominent differences were observed for SNPs with very small logistic regression P-values. For example, given the ρ ratios for PLINK vs. GLMM-W, GLMM-S, and LMM-W comparisons for SNPs with PPLINK < 5E-32 (i.e., ρ = 1.409, 0.938, and 0.862, respectively (Table S6)), a SNP with PPLINK of 10−40 would have PGLMM-W, PGLMM-S, and PLMM-W, of around 10−28, 10−43, and 10−46, respectively. Obtaining more accurate estimates of P-values for such SNPs, although not changing the final conclusions regarding the discovered SNPs-disease associations, would be important for more accurate comparisons of the results across different studies. We also found that while the small differences in P-values from PLINK and LMMs-based methods had small impact on determining disease-associated SNPs at genome-wide significance level, the differences were more likely to impact the associations calling at suggestive level of associations.

We suggest that fitting GLMMs using a Wald’s test (GLMM-W) may be preferred method compared to LMM-W as it considers a correct link function linking the outcome and explanatory variables, and would likely provide more accurate estimates of β coefficients. GLMM-W also outperforms GLMM-S by providing β coefficients rather than score coefficients. However, a major drawback of performing a genome-wide GLMM-W analysis can be its relatively large computational time. The GWA analyses through PLINK, GLMM-S, and LMM-W methods can readily be accomplished in several hours even when the analyses are not running on a multithreaded system. For example, using only one processor (Intel(R) Xeon(R) CPU E5–2680 v4 @ 2.40GHzM) on a UNIX operating system, it would take up to ~14.25 h (PLINK), ~31.5 h (GLMM-S), and ~15.25 h (LMM-W) to perform GWAS of AD/HTN in the FHS and LOADFS datasets containing 1561413 to 1832054 SNPs (Table S1). However, the GLMM-W analyses were way slower as fitting genome-wide GLMM-W models using a single processor could take more than 125 (LOADFS-AD, n=3716), 41 (LOADFS-HTN, n=1966), 119 (FHS-AD, n=4409), and 155 (FHS-HTN, n=8108) days. Therefore, since the computational time is a function of the sample size and SNPs number, performing a genome-wide GLMM-W analysis can be practically infeasible in very large cohorts (e.g., mega-analyses of tens of thousands of subjects) with several millions of genotyped and imputed SNPs unless a high performance computing system is available. To bypass the computational burden, the GLMM-W can be fitted on only a subset of SNPs that had P-values smaller than a pre-specified significance threshold in the logistic regression analyses. Since the P-values from GLMM-W can in some cases be smaller than the ones from the logistic regression models, a slightly more liberal significance threshold than the target significance threshold can be considered for pre-selecting the subsets of SNPs. For instance, if association signals at significance level of 5E-06 are of interest, it would be wise to pre-select SNPs that had PPLINK < 5E-05 or 5E-04 for GMMs analyses. The results from these GLMM-W analyses along with GC-corrected P-values for remaining SNPs can then safely be used for downstream gene-based analyses, as our analyses demonstrated that PPLINK-GC were larger than PGLMM-W for most of the SNPs with non-significant PPLINK at significance level of 5E-06 with average ρ ratios of 0.951 to 1.004. Alternatively, the GC-corrected PLINK of the entire set of SNPs may also be used for down-stream gene-based analyses, as our results showed that PPLINK-GC were larger than PGLMM-W for most of the genes with the overall ρ ratios of 0.950 to 0.997 (Tables S16S19). Also, contrasting gene-based results from PLINK-GC and GLMM-W revealed that all disease-associated genes in PLINK-GC analyses at significance level of 5E-06, were associated with diseases under consideration in GLMM-W analyses as well (Tables S20S21).

The focus of our study was on the genetic analyses of complex diseases with moderate to high h2 in the cohorts with simple pedigrees (i.e., cohorts consisting of 2 to 3 generations of mostly small-size families), which are the most commonly used familial data in human genetic studies. The two datasets analyzed here provided genotype and phenotype information for mixtures of singletons (26–29% in LOADFS and 8–16% in FHS) and families (i.e., 623 (LOADFS-AD), 278 (LOADFS-HTN), 697 (FHS-AD), and 866 (FHS-HTN) families). Around 78–94% of these families had less than 10 members and the median family sizes were 3–5. The most common family relationships found in both LOADFS and FHS families were parents-offspring and sibship relationships. Such kinds of relationships were the only relationships existed in 46–67% of LOADFS families and 36–63% of FHS families. Also, 32–49% of LOADFS families and 20–61% of FHS families had extended family relationships in addition to the parental/sibship relationships. Less than 2.5% of families had only extended family relationships (Table S1). It was previously suggested that the type-I error rates could slightly increase (e.g., 0.05–0.07, 0.06–0.1, and 0.06–0.105 for simulated sib-pair, nuclear, and large full-sib families, respectively) when family structure was disregarded in the genetic analyses of traits with h2 of 0.1 to 0.9. A similar pattern was also observed when data consisted of a mixture of parent-offspring, sibship, and extended family relationships (i.e., type-I error rates of 0.06–0.11 depending on the h2 of trait) (McArdle et al. 2007). Therefore, since the LOADFS and FHS datasets were mainly comprised of mixtures of singletons, and nuclear, full-sibs, and small extended families, it is expected that the conclusions of our study would be generalizable to cohorts with simpler pedigree structure (e.g., those containing only nuclear or full-sib families). However, further analyses are needed to examine the generalizability of our findings to more complex cohorts that mainly consist of large numbers of large-size extended and multi-generational families. Such datasets are not commonly available for the study of human complex diseases, although they are frequently used in other species. Analyzing a simulated data of 25 extended families of size 20 including grand-parents, parents (3 full-sibs and their spouses), and children (4 per each sib-spouse pair), McArdle et al. (2007) suggested the type-I error rates were between 0.06 and 0.15 for traits with h2 of 0.1 to 0.9 when family structure was ignored (McArdle et al. 2007). Although not tested here, we expect our conclusions would be valid in the case of analyzing outcomes with repeated measures in longitudinal cohorts as such additional data are not expected to dramatically impact the estimated genetic effects, instead, they would mainly increase the power of analyses to obtain more precise estimates of genetic parameters of interest.

Supplementary Material

13353_2019_526_MOESM1_ESM

Acknowledgement

This manuscript was prepared using limited access datasets that are available through dbGaP repository (https://www.ncbi.nlm.nih.gov/gap) for qualified researchers (accession numbers: phs000168.v2.p2 (LOADFS) and phs000007.v28.p10 (FHS)).

Funding support for the Late Onset Alzheimer’s Disease Family Study (LOADFS) was provided through the Division of Neuroscience, NIA. The LOADFS includes a genome-wide association study funded as part of the Division of Neuroscience, NIA. Assistance with phenotype harmonization and genotype cleaning, as well as with general study coordination, was provided by Genetic Consortium for Late Onset Alzheimer’s Disease. This manuscript was not prepared in collaboration with LOADFS investigators and does not necessarily reflect the opinions or views of LOADFS.

The Framingham Heart Study (FHS) is conducted and supported by the National Heart, Lung, and Blood Institute (NHLBI) in collaboration with Boston University (Contract No. N01-HC-25195 and HHSN268201500001I). This manuscript was not prepared in collaboration with investigators of the FHS and does not necessarily reflect the opinions or views of the FHS, Boston University, or NHLBI. Funding for SHARe Affymetrix genotyping was provided by NHLBI Contract N02-HL-64278. SHARe Illumina genotyping was provided under an agreement between Illumina and Boston University. Funding for CARe genotyping was provided by NHLBI Contract N01-HC-65226. Funding support for the Framingham Dementia dataset was provided by NIH/NIA grant R01 AG08122. Funding support for the Framingham Inflammatory Markers was provided by NIH grants R01 HL064753, R01 HL076784 and R01 AG028321.

Funding support for the Framingham C-reactive protein dataset was provided by NIH grants R01 HL064753, R01 HL076784 and R01 AG028321. Funding support for the Framingham Adiponectin dataset was provided by NIH/NHLBI grant R01-DK-080739. Funding support for the Framingham Interleukin-6 dataset was provided by NIH grants R01 HL064753, R01 HL076784 and R01 AG028321.

Funding

This research was supported by Grants from the National Institute on Aging (P01AG043352 and R01AG047310). The funders had no role in study design, data collection and analysis, decision to publish, or manuscript preparation. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Footnotes

Compliance with ethical standards

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical approval and consent to participate

This study focuses on secondary analysis of data obtained from dbGaP upon approval by local Institutional Review Board (IRB), and does not involve gathering data from human subjects directly. All procedures performed were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards.

Supplementary information

Additional File 1 containing Tables S1S21 and Figures S1S8 is provided as on-line supplementary information.

Publisher's Disclaimer: This Author Accepted Manuscript is a PDF file of a an unedited peer-reviewed manuscript that has been accepted for publication but has not been copyedited or corrected. The official version of record that is published in the journal is kept up to date and so may therefore differ from this version.

References

  1. Allison PD (1999) Comparing logit and probit coefficients across groups. Sociol Methods Res 28:186–208. doi: 10.1177/0049124199028002003 [DOI] [Google Scholar]
  2. Aulchenko YS, de Koning D-J, Haley C (2007) Genomewide rapid association using mixed model and regression: a fast and simple method for genomewide pedigree-based quantitative trait loci association analysis. Genetics 177:577–585. doi: 10.1534/genetics.107.075614 [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Aulchenko YS, Struchalin MV, van Duijn CM (2010) ProbABEL package for genome-wide association analysis of imputed data. BMC Bioinformatics 11:134. doi: 10.1186/1471-2105-11-134 [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Bakshi A, Zhu Z, Vinkhuyzen AAE, et al. (2016) Fast set-based association analysis using summary data from GWAS identifies novel gene loci for human complex traits. Sci Rep 6:32894. doi: 10.1038/srep32894 [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Bates D, Mächler M, Bolker B, Walker S (2015) Fitting linear mixed-effects models using lme4. J Stat Softw 67:1–48. doi: 10.18637/jss.v067.i01 [DOI] [Google Scholar]
  6. Chen H, Wang C, Conomos MP, et al. (2016) Control for population structure and relatedness for binary traits in genetic association studies via logistic mixed models. Am J Hum Genet 98:653–666. doi: 10.1016/j.ajhg.2016.02.012 [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Conomos MP, Miller MB, Thornton TA (2015) Robust inference of population structure for ancestry prediction and correction of stratification in the presence of relatedness. Genet Epidemiol 39:276–293. doi: 10.1002/gepi.21896 [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Dawber TR, Meadors GF, Moore FE (1951) Epidemiological approaches to heart disease: the Framingham study. Am J Public Health Nations Health 41:279–286 [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Devlin B, Roeder K (1999) Genomic control for association studies. Biometrics 55:997–1004 [DOI] [PubMed] [Google Scholar]
  10. Eu-ahsunthornwattana J, Miller EN, Fakiola M, et al. (2014) Comparison of methods to account for relatedness in genome-wide association studies with family-based data. PLOS Genet 10:e1004445. doi: 10.1371/journal.pgen.1004445 [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Evangelou E, Trikalinos TA, Salanti G, Ioannidis JPA (2006) Family-based versus unrelated case-control designs for genetic associations. PLOS Genet 2:e123. doi: 10.1371/journal.pgen.0020123 [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Feinleib M, Kannel WB, Garrison RJ, et al. (1975) The Framingham offspring study: design and preliminary data. Prev Med 4:518–525 [DOI] [PubMed] [Google Scholar]
  13. Gatz M, Pedersen NL, Berg S, et al. (1997) Heritability for Alzheimer’s disease: the study of dementia in Swedish twins. J Gerontol A Biol Sci Med Sci 52:M117–125 [DOI] [PubMed] [Google Scholar]
  14. Gatz M, Reynolds CA, Fratiglioni L, et al. (2006) Role of genes and environments for explaining Alzheimer disease. Arch Gen Psychiatry 63:168–174. doi: 10.1001/archpsyc.63.2.168 [DOI] [PubMed] [Google Scholar]
  15. Gordon D, Haynes C, Johnnidis C, et al. (2004) A transmission disequilibrium test for general pedigrees that is robust to the presence of random genotyping errors and any number of untyped parents. Eur J Hum Genet EJHG 12:752–761. doi: 10.1038/sj.ejhg.5201219 [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Kang HM, Sul JH, Service SK, et al. (2010) Variance component model to account for sample structure in genome-wide association studies. Nat Genet 42:348–354. doi: 10.1038/ng.548 [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Kulminski AM, Loika Y, Culminskaya I, et al. (2016) Explicating heterogeneity of complex traits has strong potential for improving GWAS efficiency. Sci Rep 6:35390. doi: 10.1038/srep35390 [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Kupper N, Ge D, Treiber FA, Snieder H (2006) Emergence of novel genetic effects on blood pressure and hemodynamics in adolescence: the Georgia Cardiovascular Twin Study. Hypertens Dallas Tex 1979 47:948–954. doi: 10.1161/01.HYP.0000217521.79447.9a [DOI] [PubMed] [Google Scholar]
  19. Kupper N, Willemsen G, Riese H, et al. (2005) Heritability of daytime ambulatory blood pressure in an extended twin design. Hypertens Dallas Tex 1979 45:80–85. doi: 10.1161/01.HYP.0000149952.84391.54 [DOI] [PubMed] [Google Scholar]
  20. Lee JH, Cheng R, Graff-Radford N, et al. (2008) Analyses of the national institute on aging late-onset Alzheimer’s disease family study: implication of additional loci. Arch Neurol 65:1518–1526. doi: 10.1001/archneur.65.11.1518 [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Lloyd-Jones LR, Robinson MR, Yang J, Visscher PM (2018) Transformation of summary statistics from linear mixed model association on all-or-none traits to odds ratio. Genetics 208:1397–1408. doi: 10.1534/genetics.117.300360 [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Manichaikul A, Chen W-M, Williams K, et al. (2012) Analysis of family- and population-based samples in cohort genome-wide association studies. Hum Genet 131:275–287. doi: 10.1007/s00439-011-1071-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. McArdle PF, O’Connell JR, Pollin TI, et al. (2007) Accounting for relatedness in family based genetic association studies. Hum Hered 64:234–242. doi: 10.1159/000103861 [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Nazarian A, Gezan SA (2016) GenoMatrix: a software package for pedigree-based and genomic prediction analyses on complex traits. J Hered 107:372–379. doi: 10.1093/jhered/esw020 [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Nazarian A, Yashin AI, Kulminski AM (2018) bioRxiv-Genome-wide analysis of genetic predisposition to Alzheimer’s disease and related sex-disparities. bioRxiv 321992. doi: 10.1101/321992 [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Price AL, Zaitlen NA, Reich D, Patterson N (2010) New approaches to population stratification in genome-wide association studies. Nat Rev Genet 11:459–463. doi: 10.1038/nrg2813 [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Purcell S, Neale B, Todd-Brown K, et al. (2007) PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet 81:559–575. doi: 10.1086/519795 [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Qin H, Morris N, Kang SJ, et al. (2010) Interrogating local population structure for fine mapping in genome-wide association studies. Bioinformatics 26:2961–2968. doi: 10.1093/bioinformatics/btq560 [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Shih PB, O’Connor DT (2008) Hereditary determinants of human hypertension. Hypertension 51:1456–1464. doi: 10.1161/HYPERTENSIONAHA.107.090480 [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Spielman RS, McGinnis RE, Ewens WJ (1993) Transmission test for linkage disequilibrium: the insulin gene region and insulin-dependent diabetes mellitus (IDDM). Am J Hum Genet 52:506–516 [PMC free article] [PubMed] [Google Scholar]
  31. Splansky GL, Corey D, Yang Q, et al. (2007) The third generation cohort of the national heart, lung, and blood institute’s Framingham heart htudy: design, recruitment, and initial examination. Am J Epidemiol 165:1328–1335. doi: 10.1093/aje/kwm021 [DOI] [PubMed] [Google Scholar]
  32. Tang W, Hong Y, Province MA, et al. (2006) Familial clustering for features of the metabolic syndrome: the National Heart, Lung, and Blood Institute (NHLBI) Family Heart Study. Diabetes Care 29:631–636. doi: 10.2337/diacare.29.03.06.dc05-0679 [DOI] [PubMed] [Google Scholar]
  33. Vattikuti S, Guo J, Chow CC (2012) Heritability and genetic correlations explained by common SNPs for metabolic syndrome traits. PLoS Genet 8:e1002637. doi: 10.1371/journal.pgen.1002637 [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Yang J, Lee SH, Goddard ME, Visscher PM (2011) GCTA: a tool for genome-wide complex trait analysis. Am J Hum Genet 88:76–82. doi: 10.1016/j.ajhg.2010.11.011 [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Zhou X, Carbonetto P, Stephens M (2013) Polygenic modeling with bayesian sparse linear mixed models. PLoS Genet 9:e1003264. doi: 10.1371/journal.pgen.1003264 [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Zondervan KT, Cardon LR (2007) Designing candidate gene and genome-wide case-control association studies. Nat Protoc 2:2492–2501. doi: 10.1038/nprot.2007.366 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

13353_2019_526_MOESM1_ESM

RESOURCES