Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2019 Dec 1.
Published in final edited form as: Stat Biosci. 2017 Jun 5;10(3):491–505. doi: 10.1007/s12561-017-9197-9

Robust Rare-Variant Association Tests For Quantitative Traits in General Pedigrees

Yunxuan Jiang 1, Karen N Conneely 2,*, Michael P Epstein 2,*
PMCID: PMC6329454  NIHMSID: NIHMS882163  PMID: 30643591

Abstract

Next generation sequencing technology has propelled the development of statistical methods to identify rare polygenetic variation associated with complex traits. The majority of these statistical methods are designed for case-control or population-based studies, with few methods that are applicable to family-based studies. Moreover, existing methods for family-based studies mainly focus on trios or nuclear families; there are far fewer existing methods available for analyzing larger pedigrees of arbitrary size and structure. To fill this gap, we propose a method for rare-variant analysis in large pedigree studies that can utilize information from all available relatives. Our approach is based on a kernel-machine regression (KMR) framework, which has the advantages of high power, as well as fast and easy calculation of p-values using the asymptotic distribution. Our method is also robust to population stratification due to integration of a QTDT framework (Abecasis, et al. 2000b) with the KMR framework. In our method, we first calculate the expected genotype (between-family component) of a non-founder using all founders’ information and then calculate the deviates (within-family component) of observed genotype from the expectation, where the deviates are robust to population stratification by design. The test statistic, which is constructed using within-family component, is thus robust to population stratification. We illustrate and evaluate our method using simulated data and sequence data from Genetic Analysis Workshop 18 (GAW18).

Keywords: rare variant, pedigree, quantitative trait, population stratification

1. Introduction

Next-generation sequencing (NGS) studies of complex human traits and diseases are becoming commonplace for investigating the role of rare polymorphic variation in such phenotypes. Many analytic methods have been developed for the analysis of such rare variants with a particular emphasis on techniques that first aggregate information on rare variants within a gene of interest and then contrast this aggregated genetic information with the phenotypic outcome. The majority of such aggregation-based methods (Kwee, Liu et al. 2008, Madsen and Browning 2009, Morris and Zeggini 2010, Zawistowski, Gopalakrishnan et al. 2010, Wu, Lee et al. 2011, Lee, Wu et al. 2012) focus on population-based designs or case-control designs. However, family-based study designs are gaining traction in NGS projects since they provide inherent benefits over the traditional population-based designs. In particular, families ascertained based on multiple relatives with a particular phenotype tend to enrich the sample for rare causal variants compared to a general population, thereby making such variants easier to detect (Zöllner 2012).

The appeal of family-based NGS studies has lead to the development of a few analytic methods tailored for rare-variant analysis in such designs. Such methods (Chen, Meigs et al. 2013, Schaid, McDonnell et al. 2013, Jiang and McPeek 2014, Jiang, Conneely et al. 2014) generally apply a modeling framework that accounts for the relatedness of familial samples through appropriate modeling of kinship. However, such methods do not take into account the potential bias of findings due to population stratification. Population stratification is the presence of systematic differences between sub-populations both in the allele frequencies of the rare variants under study as well as in the distribution of phenotype. Failure to model these differences will lead to inflated false positive rate and decreased power to detect real associations. For rare variants, the issue of population stratification is more severe than for common variants, as rare variants are more likely to be young mutations which are more population specific (Gravel, Henn et al. 2011). It has been shown that inclusion of self-reported ethnicity as a covariate is not sufficient to adjust for population stratification (Serre, Montpetit et al. 2008). Similarly, standard methods to adjust for population stratification for common variants may not be as effective an adjustment for rare variants. In particular, genomic control can lead to very conservative results for rare variants (Jiang, Epstein et al. 2013). Although principal components works well for spatially distinctive populations, the procedure fails for spatially non-distinctive populations (Mathieson and McVean 2012).

With these concerns in mind, Jiang et al. (2014) developed a rare-variant association test for quantitative traits in parent-child trios and nuclear families that, by design, was robust to population stratification. The method was motivated by the QTDT framework (Abecasis, et al. 2000a), which showed that the observed genotype of a familial subject could be partitioned into orthogonal between-family and within-family components. The between-family component can be defined as the expected value of the subject’s genotype within the family and can be constructed as the average of the parents’ genotype or the average of the siblings’ genotype. The within-family component is the deviation of the observed genotype from the between-family component. While the between-family component is sensitive to population stratification, the within-family component is robust to stratification since it is based on a family-specific deviation. Utilizing a kernel-machine regression (KMR) framework for multi-marker analysis of familial quantitative phenotypes (Schifano, et al. 2012, Chen, et al. 2013), Jiang et al. (2014) created a robust rare-variant test by replacing observed sample genotypes in the standard KMR with their corresponding within-family genotypic components. Simulation results demonstrated the approach yielded appropriate type-I error even when strong confounding existed within the sample. As with other KMR approaches, the Jiang et al. (2014) approach derived p-values analytically using Davies’ (1980) method, thereby allowing easy application to large scale sequencing studies.

While the work of Jiang et al. (2014) provides a powerful approach that is robust in the presence of population stratification, the method’s design limited its application only to nuclear families and parent-child trios. However, many sequencing studies have emerged that utilize phenotype and genotype data collected on multiplex pedigrees that are larger and contain more distant relationships than those in nuclear families. Examples of such studies include the Epi4K study of epilepsy (Epi4K Consortium, 2012) and the Genetic Analysis Workshop (GAW18) study of blood pressure. Large pedigrees have unique features that make them ideal for mapping traits associated with rare variants. Compared to nuclear families or trios, rare variants are further enriched in large pedigrees (Wijsman 2012). It has been shown that large pedigree studies have increased power compared to smaller families with the same total number of samples, especially for rare-variant sequencing data (Wijsman and Amos 1997, Simpson, Justice et al. 2011, Wilson and Ziegler 2011). In addition to improved power, analysis of large pedigrees can provide evidence for both co-segregation and association, while population based studies can only provide evidence for association (Laird and Lange 2006, Wijsman 2012, Ott, Wang et al. 2015). Further, the study of large pedigrees provides a cost-effective strategy for rare-variant analysis as it enables in silico imputation of rare-variant genotypes in nonsequenced subjects using information from sequenced relatives coupled to knowledge of inheritance flow (Wijsman 2012, Cheung, Blue et al. 2014). With a large pedigree-based study design, researchers can also combine sequencing-based association studies with linkage analyses (Ott, Wang et al. 2015). Recent research has identified rare variants associated with several diseases or traits like hyperkalemic hypertension (Louis-Dit-Picard, Barc et al. 2012), spinocerebellar ataxias (Wang, Yang et al. 2010), hypolipidemia (Musunuru, Pirruccello et al. 2010), and lithium-responsive bipolar disorder (Cruceanu, Ambalavanan et al. 2013) by combining association and linkage approaches.

Given the obvious value of extended pedigrees, it would be useful to develop a robust family-based association test of rare variants for such designs that is also computationally efficient. While the method of Jiang et al (2014) is both robust and fast, it also is only limited to trios and nuclear families and therefore cannot be applied to studies like GAW18 that possess sequence data for 20 Mexican American families with an average pedigree size of 70 (see sample pedigree in supplementary Figure S1). Therefore, in this paper, we propose an expansion of the Jiang et al (2014) framework to allow robust and efficient analysis of multiplex families of arbitrary size and structure. To do so, we employ a non-trivial modification of the QTDT framework for use in extended pedigrees developed by Abecasis et al. (2000b) that uses information from all genotyped family members to construct a more informative between-family genotypic component. We then derive the within-family component for each genotype and integrate this information within the KMR framework of Schifano et al. (2012) to obtain a rare-variant test that is robust to population stratification. In the following sections, we will first introduce our study setting, followed by how we use the QTDT framework to decompose genotype information to obtain a robust within-family component. We then show how to integrate this information within a KMR framework to yield our robust test. We will also describe how we can improve the power of our robust test by pre-screening potential trait-influencing genes using genotype and phenotypic information from founders across families. Such founder information is orthogonal to the within-family information used in our proposed test. We then evaluate our method using both simulation studies and sequencing data from a study of systolic and diastolic blood pressure (SBP and DBP) provided by the Genetic Analysis Workshop 18 (GAW18).

2. Materials and Methods

2.1 Study Design and Notation

We assume a family-based study consisting of N families, where each family consists of a large pedigree. While we use Figure 1 as an example here to show the structure of the large pedigree, our method can be applied to any family structure and can accommodate any family size unlike the original framework of Jiang et al. (2014). Suppose there are s rare variants in a gene of interest, and let Gij, a s × 1 vector, represent the genotypes of the s rare variants for the jth (j=1,2…,ni) individual in the ith (i=1,2…N) family. We assume an additive model, and let components in Gij take the value of 0, 1, 2, indicating the number of copies of minor alleles at each site. If an individual is not genotyped, then we leave Gij undefined. Let Xij, a c × 1 vector, denote the covariates, and denote Yij as the value of the quantitative outcome for the jth individual in the ith family. For non-founders (defined as individuals with ancestors included in the pedigree, e.g. individuals 5,6,7,8,9,10 in Figure 1), let Mij and Fij be the index of mother and father of jth individual in the ith family, respectively. For founders (defined as individuals with no ancestors in the pedigree, e.g. individuals 1,2,3,4 in Figure 1), we leave Mij and Fij undefined.

Figure 1.

Figure 1

Example of Pedigree Structure

2.2 KMR Framework for Pedigree Data

We create our robust rare-variant association test for a quantitative trait based on the KMR test of Schifano et al. (2012) and Chen et al. (2013) for association testing of a group of genetic variants with a continuous phenotype allowing for related individuals. As shown by these authors, the KMR test can be implemented in a linear mixed-modeling framework with mean and variance defined through the model:

Yij=XijTα+h(Gij)+fij+εij (1)

where α is a c×1 vector of coefficients for Xij, fij is the random effect to account for within family correlation, and εij is the random error term. We further assume that the random effects within a family, fi = (fi1, fi2, fi3, …, fini)T, follow a multivariate normal distribution fi~MVN(0,2Φiσpg2). Here Φi is the kinship matrix for the ith family (elements in Φi represent the pairwise kinship coefficients between relatives in the ith family) and σpg2 represents the variance due to the shared polygenic effect. We also assume that the random environmental effect εij is independent among subjects within and between families and follows a normal distribution with mean 0 and variance σe2.

Within equation (1) above, h(Gij) is a function of Gij defined through a positive semi-definite kernel function k(·,·). Following Liu et al. (Liu, Lin et al. 2007) and Kwee et al.(Kwee, Liu et al. 2008), h(Gij) can be represented as Σi Σj ϑij k(Gij, Gij), where ϑij are unknown parameters. It is worth noting that the kernel function, k(Gij, Gij), measures the genetic similarity between subject j in family i and subject j′ in family i′ and contrasts this similarity to phenotypic similarity between the two subjects. It has been shown that appropriate choice of the kernel can increase the power (Wu, Lee et al. 2011). Frequently used kernels include the identity-by state (IBS) kernel or the linear kernel. The IBS kernel, which takes the form k(Gij,Gij)=l=1s(2-|Gijl-Gijl|), measures the genetic similarity as the number of alleles that share by state. It assumes a nonlinear effect of each rare variant and can thus enable the study of epistatic effects. The linear kernel, on the other hand, assumes a linear relationship between the trait and the variants. The kernel takes the form k(Gij,Gij)=l=1s(GijlGijl). Additionally, we can include prior knowledge of variants that are possibly causal in the gene by assigning each variant a weight. If prior knowledge is not available, weights can also be calculated as a function of minor allele frequency (under the logic that the rarer the allele, the more likely it is selected against and therefore the more likely it is to be pathogenic). Wu et al. (2011) suggests calculating the weights based on a beta distribution, which assigns greater weight to less frequent variants. For a given weight, we can create weighted kernels such as the weighted linear kernel k(Gij,Gij)=l=1swl(GijlGijl). where wl denotes a normalized weight for variant l in the gene.

It can be easily shown that the estimator of h takes the same form as in the linear mixed model with h as a random effect (Liu, Lin et al. 2007, Schifano, Epstein et al. 2012):

y=Xα+h+f+e, (2)

where α is a c×1 vector of coefficients for fixed effect X; h is an i=1Nni×1 vector of random effects that follows an arbitrary distribution with mean 0 and variance τ K, where K is the genetic similarity matrix with element 〈ij, ij′〉 equal to k(Gij, Gij); f=(f1T,f2T,..fNT)T~N(0,2σpg2Φ), where Φ is a block diagonal matrix with Φi on the diagonal. Finally, e=(e1T,e2T,..eNT)T~N(0,σe2I). Thus, the test of whether genotype is associated with the outcome is equivalent to testing whether the random component h equals 0 or not. We adopted the variance component score test, which is the locally most powerful test (Lin 1997). As h has the variance of τ K, the test of whether h=0 is equivalent to testing whether τ = 0. The null hypothesis is H0: τ = 0, and the test statistic takes the form:

Q=12(Y-Xα0^)V0-1^KV0-1^(Y-Xα0^), (3)

where all parameters are estimated under the null hypothesis. V0^=2σpg2^Φ+σe2^I denotes the sample variance/covariance matrix estimated under the null. To obtain the null distribution of Q, we define a projection matrix P=V0-1^-V0-1^X(XTV0-1^X)-1XTV0-1^, such that PV0^P=P. Thus, under the null, we have

Q=12YTPKPY=i=1Nλiχ1i2, (4)

where λi are eigenvalues of 12DV0-1/2^KV0-1/2^D, here D=I-V0-1/2^X(XTV0-1^X)-1XTV0-1/2^. As χ1i2 are independently and identically distributed random variables, Q is distributed as an asymptotic mixture of chi-square distributions, and the p-values can be calculated using the Davies method (Davies 1980).

2.3 QTDT Framework for General Pedigrees

In the presence of population stratification, association testing of Gij with Yij in models (1) and (2) may lead to spurious association due to the underlying differences in allele frequencies of the sub-populations. However, for family studies, family members can be used as internal controls, where an expected genotype can be constructed using the family members’ information. Tests based on the within-family component (deviation of observed genotype from expected within family) will not be influenced by population structure, even in the most extreme case, where each of the N pedigrees is drawn from a different population. Here, we leverage the work of Abecasis et al. (Abecasis, Cookson et al. 2000) and present the method to calculate transmission scores for individuals in general pedigrees.

The QTDT framework (Abecasis, Cardon et al. 2000) for general pedigrees decomposes a genotype into a between-family component (which is sensitive to population stratification) and a within-family component (which is robust to population stratification). For relative j in family i, let Bij and Wij denote vectors of between-family and within-family genotype components for the s rare-variant genotypes in Gij. Assuming all parents in the pedigree are genotyped, the between-family component for founders (with no ancestors included in the pedigree) will be equal to their observed genotypes, while the between-family component for non-founders at each rare-variant genotype is equal to the average genotype of the between-family components of that individual’s parents: such that Bij=BMij+BFij2. Using the pedigree in Figure 1 as an example, suppose all the individuals in the pedigree are genotyped. Suppressing the family index for ease of presentation, the between-family components for founders 1, 2, 3, 4 are B1=G1, B2=G2, B3=G3, B4=G4, respectively. For the non-founders in the second generation, the between-family component for individual 5 is B5=B1+B22, and between-family component for 6 is B6=B3+B42. For the non-founders in the third generation, the between-family components for individual 7, 8, 9, and 10 are B5+B62=B1+B2+B3+B44. It can be seen that, in the situation where all founders are genotyped, the between-family component of any non-founder is calculated as:

Bij=fF2φijfGif, (5)

where in the ith family, f is the index of founders, Gif is the rare-variant genotype vector of the founder, φijf is the kinship coefficient between individual j and founder f, and F is the set of all the genotyped founders.

In the situation where the parents’ genotypes are missing, the between-family component Bij is equal to the average of the genotypes for all sibling of relative j. For example in Figure 1, if individuals 5 and 6 are not genotyped, then the between-family component for individuals 7, 8, 9, and 10 is G7+G8+G9+G104. The average of genotypes of siblings in the family is the sufficient statistic for the between-family component (Abecasis, Cardon et al. 2000). We note that, when applied to parent-child trios and nuclear families, the proposed method for calculating the between-family component we describe here is then equivalent to the forms of the between-family component outlined in the work of Jiang et al. (2014).

The within-family genotype vector for the s rare-variant genotypes Wij is then calculated as the difference between the observed genotype vector and the between-family genotype vector:

Wij=Gij-Bij (6)

Positive values within Wij indicate excess transmission of the minor (reference) allele, while negative values of Wij indicate excess transmission of the major allele. As discussed above, the within-family component is not influenced by population substructure; thus, the test on the within-family component is robust to population stratification.

As discussed before, directly testing based on the observed rare-variant genotypes in models (1) and (2) will lead to spurious association in the presence of population stratification. For our robust test, we follow the same approach as in our earlier work (Jiang et al., 2014) and simply calculate Wij as described above, replace Gij with Wij in equations (1) and (2), and construct our score statistic Q in (3) using Wij.

2.4 Screening Methods

Although the within-family component has the advantage of robustness to population stratification, constructing tests based only on the within-family genotypic component while ignoring the between-family component reduces power. However, if founders’ phenotype and genotype data are available, we can borrow the idea of Purcell et al. (Purcell, Sham et al. 2005) to implement a screening procedure to potentially increase power. Specifically, we use the founders’ phenotype and genotype information in the first stage to identify those regions showing strongest signals of association. We can perform such testing using standard burden or variance-component tests for unrelated subjects. We then implement a second stage where we test only the top regions from the first stage using our proposed test in (3) based on the within-family genotypic component; The number of top regions in the second stage can take a value between 1 and the total number of regions. In this project, we assume 10%–50% of the regions enter the second stage. By pre-screening in this manner, we reduce the multiple-testing burden for our robust test thereby increasing power. As the within-family component and the between-family component are orthogonal to each other by design (Abecasis, Cookson et al. 2000), population stratification that can invalidate the first-stage analysis using founders will not invalidate the within-family component test.

2.5 Simulation Studies

We evaluate type 1 error rate and power of our method using simulated sequencing data generated by cosi (Schaffner, Foo et al. 2005), which has high resemblance with empirical data. To simulate large pedigrees, we first use cosi to simulate 5000 haplotypes of European ancestry and 5000 haplotypes of African ancestry. We then randomly draw and pair haplotypes within each population and randomly select one haplotype from each parent to pass down to offspring. Our simulated pedigree has the same structure as Figure 1. We assume that there are 10 non-overlapping genes or regions of interest, each 30kb long. We show the empirical distribution of rare variants in these regions across simulated datasets in Supplementary Figure S2. For each family, we simulate phenotype data from a multivariate normal distribution, whose mean and variance vary according to different scenarios.

For type I error rate simulations, all 10 regions are null, while for power simulations we randomly select one region of the 10 to harbor causal variation. Rare variants are defined as variants with minor allele frequency (MAF) smaller than 3%. To simulate population substructure, we simulate the outcome for the null model as: Yij = γ IAfrican,ij + fij + eij, where γ is the mean trait difference between European and African, and IAfrican,ij is the indicator variable, which is 1 for African individuals and 0 for European individuals. For the power simulations, we let either 5% or 15% of the rare variants in the causal region influence phenotype. Within each family, we simulate the random effects fij through fi~MVN(0,0.56 × 2Φi). eij is the random error and follows a standard normal distribution. For each causal variant, we define the effect size as β = c×|log10 MAF|, where c is a pre-defined constant. Thus, the outcome is simulated as Yij = γIAfrican,ij + βij×Gij + fij + eij. We perform 5000 simulations to evaluate type I error rate. For power simulation, we also perform 5000 simulations and calculate power as the proportion of simulations with the causal region correctly identified. Unless otherwise noted, we applied a linear genotype kernel for analysis.

2.6 GAW18 Data

The Genetic Analytic Workshop 18 (GAW18) provides whole genome sequence data for extended pedigrees and phenotypes such as systolic blood pressure (SBP) and diastolic blood pressure (DBP). The dataset was drawn from the T2D-GENES Consortium Project 2; a family-based study that aims to identify low-frequency variants that increase the risk of type-2 diabetes. The original dataset contains whole genome sequences for the odd numbered chromosomes only (chromosomes 1, 3, 5,…,21) for 464 individuals from 20 Mexican American families. The dataset we used in this project contains 959 individuals. 464 of them were directly sequenced by Complete Genomics Inc, while the remaining 495 had sequence data imputed from array-based genotype data by the T2D-GENES Consortium. In addition to SBP and DBP, the dataset also includes information on age, gender, current use of antihypertensive medicine, and current smoking status. We include these phenotypes as covariates in our model. Detailed information about the dataset can be found at Almasy et al. (Almasy, Dyer et al. 2014)

After standard data cleaning procedure removed subjects with missing SBP or DBP measurements, our final dataset contained 855 individuals. Genes were annotated using information from the 1000 Genome Project (http://www.1000genomes.org/). We tested all genes in the 11 odd-numbered chromosomes, where each gene was tested individually. For each gene, we calculated the empirical frequency of the variants within the gene and only performed tests on the rare variants, where a rare variant was defined as having a minor-allele frequency (MAF) less than 3%. For perspective, we show the empirical distribution of rare variants within genes in the GAW18 project in Supplementary Figure S3. We constructed the test statistics using within-family components as defined above.

3. Results

3.1 Type I Error

We first performed null simulations to show that population stratification can lead to inflated type I error rate for sequencing studies of large pedigrees. Figure 2 summarizes the empirical type I error rates of a study with 25 European pedigrees and 75 African pedigrees, each with the same size and family structure as shown in Figure 1. We first set the mean trait difference (γ) between European and African to be 1 (Figure 2 Left) and further increased it to 2 (Figure 2 Right). Both figures show that in the presence of population stratification, test statistics constructed on observed genotype have inflated type I error rates (yellow bars in Figure 2). As population structure becomes more extreme, the inflation becomes more severe (Figure 2 Right). We then performed tests based on our robust test statistics based on our two-stage screening procedure using founders’ genotypes and phenotypes. Figure 2 shows that testing on the within-family component combined with the screening method leads to appropriate control of the type I error rate in the presence of population stratification.

Figure 2.

Figure 2

Type 1 Error Rates. Left: Mean trait difference between European and African is 1. Right: Mean trait difference between European and African is 2. 10 30-kb regions are simulated. Yellow bar: Type 1 error rate tested on observed genotype. Others: Type 1 error rate tested on within-family component, with different number of genes at second stage. Black line: y=0.05

3.2 Power

We next examined power of the proposed robust test. For power simulations, we assume the mean trait different between European and African is 0.25. For each simulation, we randomly drew 25 European pedigrees and 75 African pedigrees from the haplotype pools. We varied the percentage of rare causal variants in the causal region from 5% (Figure 3a) to 15% (Figure 3b). We also assumed different effect sizes (β = |log10 MAF|) for the causal variants by letting c take the values 0.4, 0.5, and 0.6. Figure 3 shows that power increases as the percentage of causal variants in a region increases and as the effect size increases. We next investigated whether the two-stage screening approach using founder information improves power over a within-family analysis that ignores screening. As shown in Figure 3, screening on the top 10%–50% of hits can yield noticeable improvements in power over the naïve strategy. In addition to applying the linear genotype kernel, we also considered a weighted genotype linear kernel for screening and analysis (with weights based on minor-allele frequencies using the weight function of Wu et al. (2011)). Results, which we show in Supplementary Figure S4, show similar results to the linear genotype kernel. With screening, we observed slight improvement of the weighted linear kernel over the unweighted linear kernel, particularly when larger effect sizes were assumed.

Figure 3.

Figure 3

Figure 3

Power to detect rare-variant association in large pedigrees. Figure 3a: 5% of rare variants in the causal region are causal variants. Figure 3b: 15% of rare variants in the causal region are causal variants. Yellow bars: Power without screening. Others: Power with screening. Mean trait different between European and African is 0.25. 10–50% regions entered second stage.

3.3 Application to GAW18 Dataset

We used GAW18 data to test for association between DBP/SBP and genes on odd chromosomes. Within each gene, we calculated empirical frequencies of variants and only tested on variants with frequencies smaller than 3%. GAW18 provides longitudinal phenotype information, where SBP and DBP were measured in up to four follow-ups for each subject. We used the baseline measurement to test for association. We also controlled for age, gender, current usage of anti-hypertensive medicine, and current smoking status in our model. The pedigrees are relatively large in the dataset. The median number of individuals in a pedigree is 37 (min 22, max 74). Among the participants, 20.2% of them smoke, 9.4% took medicine, and 57.7% of them are female.

We performed association tests using our robust test. The genome-wide significance level with Bonferroni correction is: αBonferroni = 0.05/7034 = 7.1×10−6. We chose the linear weighted kernel and used the Davies method to calculate p-values. Following Wu et al. (2011), the weight is calculated as wj ~ Beta(MAFj, 1,25). The results of testing SBP and DBP are summarized in Figure 4. As shown in Figure 4, we did not observe any genes passing the genome-wide significance level (7.1×10−6, based on Bonferroni adjustment for 7034 genes). At the suggestive level (1×10−4), one gene on chromosome 21 is associated with SBP, and one gene on chromosome 7 is associated with DBP. The gene associated with SBP is open reading frame 33 (C21orf33), which is a protein-coding gene and is over-expressed in Down Syndrome (Yahya-Graison, et al. 2007). LSM5 is associated with DBP at the suggestive level. It has been found that human LSM1 to LSM7 genes were expressed in Hela cells within cytoplasmic foci (Ingelfinger et al., 2002), which contains important factors in the degeneration of mRNA. In addition to the Manhattan plots shown in Figure 4, we also constructed QQ plots of results using both the observed genotypes and the within-family components of the genotypes. We present these QQ plots in Supplementary Figure S5, which show inflation of SBP (but not DBP) when analyzing observed genotypes. We observed no such inflation when analyzing the within-family component, although results for SBP showed some deflation in p-values.

Figure 4.

Figure 4

Manhattan plots for GAW18 analyses. Figure 4a: Association analyses between SBP and within-family component of genotypes within genes on odd number of chromosomes. Figure 4b: Association analyses between DBP and within-family component of within genes on odd number of chromosomes. Red line: Genome-wide significant level (p<7.1×10−6), Blue line: Suggestive significant level (p<1×10−4).

4. Discussion

In this paper, we presented a framework for rare-variant sequencing studies in large pedigrees. Large pedigrees have several important features that make them ideal for finding traits associated rare variants. Our previous work for robust and efficient family-based analysis (Jiang et al. 2014) was only applicable to parent-case trios or nuclear families and so, in this work, we expand the work to handle these large pedigrees of arbitrary size and structure such as those in the GAW18 study of blood pressure. Our model, which combines a kernel machine framework for rare-variant analysis with a QTDT framework for general pedigrees, provides a powerful, efficient, and robust way to identify such associations in large pedigree studies. As the test score statistics follows an asymptotically mixed chi-square distribution, the calculation of p-values is much easier compared to other methods. This feature also makes our model applicable to large-scale genetic studies. We also applied our method on GAW 18 data to identify SBP/DBP associated rare variants. We tested all the genes on odd numbers of chromosomes. This application gives an example that our method can be easily applied to large-scale data. The analysis of a gene takes 70 seconds on a 768 processors running Linux OS with 512 GB or RAM. The data from GAW18 are based on 20 extended Mexican-American families. For studies that do not have records of participants’ geographic origin or studies whose participants are from different origins, our method provides a robust way to perform the test.

In this project, we assumed that rare variants only associated with a single phenotype. However, there is substantial interest in identifying genetic factors with pleiotropic effects that influence multiple distinct phenotypes. Current methods for family data are not well equipped to investigate the effect of pleiotropy. For example, while analyzing GAW18 data, analyses seeking to identify genes simultaneously associated with both SBP and DBP cannot be performed. However, Broadaway et al. (2016) provide a framework that can test cross-phenotype effects of rare variants. Their method is based on kernel distance-covariance, whose test statistics also asymptotically follow a mixed chi-square distribution. In contrast to our method presented here, Broadaway et al. focused only on unrelated individuals. In the future, we would like to combine our robust test with the method of Broadaway et al. (2016) to test cross-phenotype effects of rare variants in related individuals.

Supplementary Material

12561_2017_9197_MOESM1_ESM

Figure S1. Sample pedigree in GAW 18 data

Figure S2. Histogram of number of rare variants across simulations

Figure S3. Histogram of number of rare variants in each gene on chromosome 1 in the GAW18 dataset.

Figure S4. Power to detect rare-variant association in large pedigrees using weighted linear kernel. 5% of rare variants in the causal region are causal variants. Yellow bars: Power without screening. Others: Power with screening. Mean trait different between European and African is 0.25. 10–50% regions entered second stage.

Figure S5. QQ plots from GAW18 analyses. Top Left: Test on association between observed genotype and SBP. Top Right: Test on association between within-family component and SBP. Bottom Left: Test on association between observed genotype and DBP. Bottom Right: Test on association between within-family component and DBP

Acknowledgments

This work was supported by NIH grants GM117946 and HG007508. The Genetic Analysis Workshop 18 (GAW18) is supported by NIH grant R01 GM031575. The GAW18 whole genome sequence data were provided by the T2D-GENES Consortium, which is supported by NIH grants U01 DK085524, U01 DK085584, U01 DK085501, U01 DK085526, U01 DK085545. The other genetic and phenotypic data for GAW18 were provided by the San Antonio Family Heart Study and San Antonio Family Diabetes/Gallbladder Study, which are supported by NIH grants P01 HL045222, R01 DK047482 and R01 DK053889

References

  1. Abecasis GR, et al. A general test of association for quantitative traits in nuclear families. Am J Hum Genet. 2000;66(1):279–292. doi: 10.1086/302698. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Abecasis GR, et al. Pedigree tests of transmission disequilibrium. European Journal of Human Genetics. 2000;8(7) doi: 10.1038/sj.ejhg.5200494. [DOI] [PubMed] [Google Scholar]
  3. Almasy L, et al. Data for Genetic Analysis Workshop 18: human whole genome sequence, blood pressure, and simulated phenotypes in extended pedigrees. BMC proceedings, BioMed Central. 2014 doi: 10.1186/1753-6561-8-S1-S2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Chen H, et al. Testing genetic association with rare and common variants in family data. Genetic epidemiology. 2014;38(S1):S37–S43. doi: 10.1002/gepi.21823. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Chen H, et al. Sequence kernel association test for quantitative traits in family samples. Genetic epidemiology. 2013;37(2):196–204. doi: 10.1002/gepi.21703. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Cheung CY, et al. A statistical framework to guide sequencing choices in pedigrees. The American Journal of Human Genetics. 2014;94(2):257–267. doi: 10.1016/j.ajhg.2014.01.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Cruceanu C, et al. Family-based exome-sequencing approach identifies rare susceptibility variants for lithium-responsive bipolar disorder 1. Genome. 2013;56(10):634–640. doi: 10.1139/gen-2013-0081. [DOI] [PubMed] [Google Scholar]
  8. Davies RB. Algorithm AS 155: The Distribution of a Linear Combination of χ2 Random Variables. Journal of the Royal Statistical Society Series C (Applied Statistics) 1980;29(3):323–333. [Google Scholar]
  9. Gravel S, et al. Demographic history and rare allele sharing among human populations. Proceedings of the National Academy of Sciences. 2011;108(29):11983–11988. doi: 10.1073/pnas.1019276108. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Jiang D, McPeek MS. Robust rare variant association testing for quantitative traits in samples with related individuals. Genetic epidemiology. 2014;38(1):10–20. doi: 10.1002/gepi.21775. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Jiang Y, et al. Flexible and Robust Methods for Rare-Variant Testing of Quantitative Traits in Trios and Nuclear Families. Genetic epidemiology. 2014;38(6):542–551. doi: 10.1002/gepi.21839. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Jiang Y, et al. Assessing the impact of population stratification on association studies of rare variation. Human heredity. 2013;76(1):28–35. doi: 10.1159/000353270. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Kwee LC, et al. A powerful and flexible multilocus association test for quantitative traits. The American Journal of Human Genetics. 2008;82(2):386–397. doi: 10.1016/j.ajhg.2007.10.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Laird NM, Lange C. Family-based designs in the age of large-scale gene-association studies. Nature Reviews Genetics. 2006;7(5):385–394. doi: 10.1038/nrg1839. [DOI] [PubMed] [Google Scholar]
  15. Lee S, et al. Optimal tests for rare variant effects in sequencing association studies. Biostatistics. 2012;13(4):762–775. doi: 10.1093/biostatistics/kxs014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Lin X. Variance component testing in generalized linear models with random effects. Biometrika. 1997;(84):309–326. [Google Scholar]
  17. Liu D, et al. Semiparametric Regression of Multidimensional Genetic Pathway Data: Least-Squares Kernel Machines and Linear Mixed Models. Biometrics. 2007;63(4):1079–1088. doi: 10.1111/j.1541-0420.2007.00799.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Louis-Dit-Picard H, et al. KLHL3 mutations cause familial hyperkalemic hypertension by impairing ion transport in the distal nephron. Nature genetics. 2012;44(4):456–460. doi: 10.1038/ng.2218. [DOI] [PubMed] [Google Scholar]
  19. Madsen BE, Browning SR. A groupwise association test for rare mutations using a weighted sum statistic. PLoS genetics. 2009;5(2):e1000384. doi: 10.1371/journal.pgen.1000384. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Mathieson I, McVean G. Differential confounding of rare and common variants in spatially structured populations. Nature genetics. 2012;44(3):243–246. doi: 10.1038/ng.1074. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Morris AP, Zeggini E. An evaluation of statistical approaches to rare variant analysis in genetic association studies. Genet Epidemiol. 2010;34(2):188–193. doi: 10.1002/gepi.20450. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Musunuru K, et al. Exome sequencing, ANGPTL3 mutations, and familial combined hypolipidemia. New England Journal of Medicine. 2010;363(23):2220–2227. doi: 10.1056/NEJMoa1002926. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Ott J, et al. Genetic linkage analysis in the age of whole-genome sequencing. Nature Reviews Genetics. 2015 doi: 10.1038/nrg3908. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Purcell S, et al. Parental phenotypes in family-based association analysis. Am J Hum Genet. 2005;76(2):249–259. doi: 10.1086/427886. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Schaffner SF, et al. Calibrating a coalescent simulation of human genome sequence variation. Genome Res. 2005;15(11):1576–1583. doi: 10.1101/gr.3709305. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Schaid DJ, et al. Multiple genetic variant association testing by collapsing and kernel methods with pedigree or population structured data. Genetic epidemiology. 2013;37(5):409–418. doi: 10.1002/gepi.21727. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Schifano ED, et al. SNP set association analysis for familial data. Genet Epidemiol. 2012;36(8):797–810. doi: 10.1002/gepi.21676. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Serre D, et al. Correction of population stratification in large multi-ethnic association studies. PLoS One. 2008;3(1):e1382. doi: 10.1371/journal.pone.0001382. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Simpson CL, et al. BMC proceedings. BioMed Central Ltd; 2011. Old lessons learned anew: family-based methods for detecting genes responsible for quantitative and qualitative traits in the Genetic Analysis Workshop 17 mini-exome sequence data. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Wang JL, et al. TGM6 identified as a novel causative gene of spinocerebellar ataxias using exome sequencing. Brain. 2010;133(12):3510–3518. doi: 10.1093/brain/awq323. [DOI] [PubMed] [Google Scholar]
  31. Wijsman EM. The role of large pedigrees in an era of high-throughput sequencing. Human genetics. 2012;131(10):1555–1563. doi: 10.1007/s00439-012-1190-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Wijsman EM, Amos CI. Genetic analysis of simulated oligogenic traits in nuclear and extended pedigrees: summary of GAW10 contributions. Genetic epidemiology. 1997;14(6):719–735. doi: 10.1002/(SICI)1098-2272(1997)14:6<719::AID-GEPI28>3.0.CO;2-S. [DOI] [PubMed] [Google Scholar]
  33. Wilson AF, Ziegler A. Lessons learned from Genetic Analysis Workshop 17: transitioning from genome-wide association studies to whole-genome statistical genetic analysis. Genetic epidemiology. 2011;35(S1):S107–S114. doi: 10.1002/gepi.20659. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Wu MC, et al. Rare-variant association testing for sequencing data with the sequence kernel association test. The American Journal of Human Genetics. 2011;89(1):82–93. doi: 10.1016/j.ajhg.2011.05.029. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Wu MC, et al. Rare-variant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet. 2011;89(1):82–93. doi: 10.1016/j.ajhg.2011.05.029. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Zawistowski M, et al. Extending rare-variant testing strategies: analysis of noncoding sequence and imputed genotypes. The American Journal of Human Genetics. 2010;87(5):604–617. doi: 10.1016/j.ajhg.2010.10.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Zöllner S. Sampling strategies for rare variant tests in case control studies. European Journal of Human Genetics. 2012;20(10):1085–1091. doi: 10.1038/ejhg.2012.58. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

12561_2017_9197_MOESM1_ESM

Figure S1. Sample pedigree in GAW 18 data

Figure S2. Histogram of number of rare variants across simulations

Figure S3. Histogram of number of rare variants in each gene on chromosome 1 in the GAW18 dataset.

Figure S4. Power to detect rare-variant association in large pedigrees using weighted linear kernel. 5% of rare variants in the causal region are causal variants. Yellow bars: Power without screening. Others: Power with screening. Mean trait different between European and African is 0.25. 10–50% regions entered second stage.

Figure S5. QQ plots from GAW18 analyses. Top Left: Test on association between observed genotype and SBP. Top Right: Test on association between within-family component and SBP. Bottom Left: Test on association between observed genotype and DBP. Bottom Right: Test on association between within-family component and DBP

RESOURCES