Abstract
There is great interest in detecting associations between human traits and rare genetic variation. To address the low power implicit in single-locus tests of rare genetic variants, many rare-variant association approaches attempt to accumulate information across a gene, often by taking linear combinations of single-locus contributions to a statistic. Using the right linear combination is key—an optimal test will up-weight true causal variants, down-weight neutral variants, and correctly assign the direction of effect for causal variants. Here, we propose a procedure that exploits data from population controls to estimate the linear combination to be used in an case-parent trio rare-variant association test. Specifically, we estimate the linear combination by comparing population control allele frequencies with allele frequencies in the parents of affected offspring. These estimates are then used to construct a rare-variant transmission disequilibrium test (rvTDT) in the case-parent data. Because the rvTDT is conditional on the parents’ data, using parental data in estimating the linear combination does not affect the validity or asymptotic distribution of the rvTDT. By using simulation, we show that our new population-control-based rvTDT can dramatically improve power over rvTDTs that do not use population control information across a wide variety of genetic architectures. It also remains valid under population stratification. We apply the approach to a cohort of epileptic encephalopathy (EE) trios and find that dominant (or additive) inherited rare variants are unlikely to play a substantial role within EE genes previously identified through de novo mutation studies.
Introduction
Genome wide association studies (GWASs) have identified thousands of disease-associated variants. Though these variants have often informed on biologic processes involved in disease, they have explained only a small fraction of the genetic variance of most disease phenotypes.1 Researchers have proposed that rare variants of large effect may account for this “missing heritability.”2,3 Because rare variants are generally not present in GWAS platforms and as next-generation sequencing technologies become economical, many research groups are transitioning to whole-genome or whole-exome sequencing as their primary approach to measuring genetic variation.
Compared to common variants, rare variants are more likely to be mutations of recent origin and therefore more likely to be population specific. If differences in disease risk also occur between populations, a strong correlation can be induced between rare variation and disease risk, even when there is no causal relationship between variants and disease. This can lead to spurious associations when analyzing case-control studies of unrelated individuals. Although various methods have been proposed to adjust for such spurious correlations in the context of common variation,4–6 Mathieson and McVean7 have shown that in certain situations these methods may fail to correct for spurious association in the context of rare variation. An alternative strategy to dealing with confounding due to population structure is to employ family-based tests of association such as the transmission disequilibrium test (TDT). The TDT compares alleles that are transmitted from parents to an affected offspring to the alleles that are untransmitted. Deviation from Mendelian transmission rates is evidence that the site being tested either is itself a disease locus or is in linkage disequilibrium with a disease locus. An important feature of this analysis is that the comparison is within a family—comparing transmitted to untransmitted alleles—making the TDT robust to confounding due to population stratification.
Like other single-locus methods, single-locus TDT analyses will have low power when the disease-associated allele is rare. In order to address this problem, current approaches to analyzing rare genetic variation accumulate information across a gene or other genetic unit, often by taking a weighted combination of single-locus contributions to a score test or other statistic. Using the “right” weighted combination is critical; an optimal test will up-weight true causal variants and down-weight neutral variants. Because the true causal loci are unknown, much of the rare-variant association literature involves identifying flexible approaches to weighting individual loci.8–11 Similar methods have been applied in the context of rare-variant analyses in family-based designs.12,13
In this manuscript, we propose a rare-variant TDT (rvTDT) that employs a novel approach to estimating powerful linear combinations of variants, within a genetic unit, by utilizing data from population controls. Specifically, we weight loci based on comparing variant frequencies observed in the parents of affected offspring with those observed population controls. Because the rvTDT is conditional on parental genotype, using parental data in deriving these weights does not affect the validity or asymptotic distribution of the rvTDT.
In the next section we present a general framework for deriving rvTDTs. We show how this framework leads to standard burden as well as “directionless” rare-variant tests analogous to the sequence kernel association test (SKAT). Locus-specific coefficients are a feature of each of these tests and we show how powerful linear combinations can be derived by comparing parental data with population controls. In the results section, using simulation, we show that our approach can dramatically improve the power both of burden and “directionless” rvTDTs across a wide range of genetic architectures. Finally, we apply these methods to 149 epileptic encephalopathy (EE [MIM 308350]) trios by using ∼6,500 samples from the Exome Sequencing Project (National Heart, Lung and Blood Institute [NHLBI] Exome Sequencing Project, Seattle, WA) as population controls.
Material and Methods
A General Framework for Rare-Variant Transmission Disequilibrium Tests
We begin by characterizing the standard conditional-on-parental-genotype likelihood14,15 for a single (jth) locus. This model specifies the distribution of offspring genotypes conditional on parent genotypes and the affection status of the child in terms of a relative risk disease model that is a function of the child’s genotype and Mendelian transmission probabilities. Throughout, we will denote random variables with uppercase letters and realizations of those random variables by lowercase. We assume that we have a sample of n parent-offspring trios in which the offspring is affected by the disorder being studied. Let cij, mij, pij denote the ith trio’s child, maternal, and paternal genotypes, respectively, for the jth locus. We will assume that these will be encoded in terms of the number of minor alleles observed, so that each genotype can be 0, 1, or 2. Let A = 1 indicate that the offspring is affected and let x be a design vector that encodes the genetic effect of the offspring’s genotype on disease risk. In all analyses reported here, we will assume an additive model, so that x(c) is simply the number of mutant alleles observed, i.e., x(c) = c. We note that when variants are rare, the additive and dominant models will approximate each other. Let Pr(A = 1|C = c)/Pr(A = 1|C = 0) be the risk of an offspring being affected given they have c copies of the mutant allele relative to their risk when they have no copies. If we model this relative risk through the parameter β via Pr(A = 1|C = c)/Pr(A = 1|C = 0) = exp[β⋅x(c)], then the ith trio’s contribution to the conditional-on-parent likelihood for the jth locus can be written as
where the probability Pr(C = c|M = m, P = p) is the Mendelian probability of an offspring having genotype c given parental genotypes m and p and, thus, is made up of known constants.
Differentiating the log-likelihood log [Lij(β)] with respect to β and evaluating at β = 0 gives the ith trio’s contribution to the score for the jth locus
A score test for the jth locus can then be formed as
which is a realization of a random variable that asymptotically, under the null hypothesis (β = 0), will be distributed as chi-square on 1 degree of freedom.
When variants are rare, single-marker tests will have low power. A standard approach is to accumulate information across a gene or other genetic unit.8,9,16,17 One way to do this is to create a gene-level test by taking linear combinations of score contributions across a gene. Specifically, assume there are k variants within a gene and let
where the αjs are coefficients that define the linear combination and, for now, are assumed to be fixed. A gene-level test statistic is given by
where LC is “linear combination.” Under the global null hypothesis that none of the k loci in the gene are associated with the affection status of the offspring, tLC is a realization of as n → ∞.
An alternative approach to accumulating information across a gene into a gene-level test begins by first summing score contributions across individuals, i.e., by forming , and then taking a linear combination of the resulting squared statistics, i.e.,
The K in tK denotes “kernel” and this statistic is similar in structure to SKAT and other kernel-based rare-variant association methods.11,13 As such, it can be shown that, under the global null hypothesis, tK is a realization of
where λj is the jth eigenvalue of the k × k covariance matrix of the U.js. We can estimate this covariance matrix empirically by
where U is the n × k matrix with i, j component uij and D is the k × k diagonal matrix with diagonal elements α1, …, αk. We use Davies method to approximate the null distribution of TK.18,19 Note that both statistics (TLC and TK) are functions of a vector of coefficients α = (α1, …, αk)T. To make this explicit, we will write TLC(α) and TK(α).
Incorporating Information from Population Controls
The power of TLC(α), TK(α), and similar gene-level rare-variant tests are critically dependent on the choice of α. An optimal test will up-weight true causal variants, down-weight neutral variants, and correctly assign the direction of the effect for causal variants. Previous approaches to choose α have attempted to leverage hypothesized relationships between penetrance and variant frequency8 or have estimated the optimal linear combination from the data.20 Here we also take a data-driven approach and estimate α by direct comparison of variant frequencies in parents to those in a large population control databases. Intuitively, this is exploiting the fact that if a variant is associated with a child being affected, it will tend to be enriched not only in the offspring but also in the parents (because all variants found among offspring must also be present in their parents). Thus this comparison will be informative for determining which variants are likely to be important and which are not. However, unlike other methods that attempt to estimate an optimal linear combination from the data, the parent/population-control comparison used to estimate α is orthogonal to the final test and therefore does not affect the validity or asymptotic distribution of TLC(α) and TK(α) regardless of whether α is “correctly” estimated or not. A detailed proof of this claim can be found in Appendix A.
There are many ways one could estimate α from the control/parent comparison. When individual-level data are available, one could estimate the coefficients jointly by, for example, fitting a regularized multivariable logistic regression model, with control/parent as the outcome, to all the variants being considered at once. However, there are very large publically available data sets for which only aggregated summary statistics at each variant site are given. For example, the Exome Sequencing Project (ESP) data contain well-characterized, deep coverage, whole-exome sequencing data on more than 6,500 individuals. However, until recently, ESP reported only overall genotype counts at each variant site. This necessitates a marginal approach to estimating the coefficients, where the population controls and the parents’ allele frequencies are compared site-by-site. In simulations (not shown) we found that the jointly estimated coefficients performed slightly better than the marginally estimated coefficients when the control sample size for both analyses was the same. However, the marginal analysis currently allows the incorporation of much larger control sets and this was the dominant factor in these analyses, i.e., the marginal estimates had much higher power when the size of the control sample size was on the order of ESP than that observed from a joint analysis applied to the smaller control samples containing individual-level data that are currently available. For this reason we focus on the marginal estimation of α such that, for the jth locus, αj is simply a signed value of the Cochran-Armitage trend test statistic,21 obtained by comparing parents to population controls. The sign is given by the direction of any allele frequency differences between cases and controls: when mutations at the jth locus are more frequent in parents, αj remains positive; when they are less frequent in parents, αj is made to be negative. We denote this estimate of α by αPC. We note that individual-level ESP data are now available through dbGAP. We intend to conduct a detailed investigation into methods for jointly estimating α in a future manuscript.
Simulation
We compared the performance of our proposed approach to several existing tests for rare variant association in parent-offspring trios by using simulated data. In order to obtain simulated data having linkage disequilibrium patterns that are similar to true whole-exome sequences, we used a coalescent-based approach. Specifically, we simulated 100,000 haplotypes of a 20 kb sequence using COSI22 representing the complete sequence of a gene (i.e., both exons and introns). Because we are attempting to mimic data generated from a whole-exome sequencing experiment, we extract from these 20 kb sequences 5 randomly located subregions representing “captured coding sequences” for a total “coding” length of 1.5 kb. We formed genotypes of founders by randomly sampling haplotypes from this pool. The affection status (A) of each simulated sample was determined based on its simulated genotype by randomly sampling from a Bernoulli random variable with disease probability (i.e., Pr(A = 1)) given by exp(β0 + XTβ)/(1 + exp(β0 + XTβ)), where X represents a design vector involving a set of disease-causal variants and β is a vector encoding the effect of these variants on disease. We assumed a model in which variants at each locus have the same population attributable fraction (PAF). This model allows us to control the fraction of cases whose disease is explained by mutations at a locus while also parameterizing β in terms of variant frequency. Specifically, this model corresponds to choosing the elements of β to be |βk| = log(1 + η/(2 MAFk)) (see Appendix B), where η is the per-locus PAF and MAF denotes the minor allele frequency. The sign of β is chosen so that risk alleles are given a positive sign and protective alleles are given a negative sign. The intercept β0 is related to the prevalence of disease and is taken to be log(0.05/1−0.05) = −2.94. Controls are selected from samples with A = 0. For trios, we generate parents’ genotypes as above, and then, assuming no crossover within the gene, generate offspring’s genotype by randomly selecting one haplotype from each parent. Once the offspring’s genotype is generated, we determine the affection status of the offspring as above and keep only those trios in which the offspring is affected. We continue this process until we obtain a sample comprised of 500 trios and 5,000 population controls. We considered analyses involving both the rare (variants with MAF less than 0.01) and common (up to a MAF of 0.05) variants. Each simulation conducted assuming the null hypothesis was based on 10,000 replicates and each simulation conducted assuming an alternative hypothesis was based on 1,000 replicates.
To confirm that test size is well maintained in the presence of population stratification as well as population admixture, we simulated, using COSI,22 two populations: European and African. We induced confounding due to population stratification by simulating different disease prevalences in the two populations: 0.05 for Europeans and 0.01 for Africans. Following the simulation scheme detailed above, we simulated 500 case-parent trios and 5,000 population controls, under two scenarios. In the first, we sampled both trios and controls from a parent population in which Africans and Europeans were represented in equal proportions, i.e., 50% African and 50% European. In the second, in order to generate more pronounced bias in estimated coefficients due to population structure, we simulated a scenario where the population control source population had very different population structure from the parent population from which the trios were sampled. In particular, we assume the control source population was comprised of a 20:80 mix of Europeans to Africans, and the trio-parent population was comprised of an 80:20 mix. We also simulated admixture. Similar to the second population stratification scenario, we generate large differences in admixture proportions between population controls and trios by sampling controls from a source population in which individuals have an average admixture proportion that is 80% African and 20% European, while we sample trios from a parent population in which these proportions are reversed, i.e., 20% African and 80% European. Prevalence was taken to be 0.01 in the control source population and 0.05 in the trio source population. In order to further illustrate that our approach maintains the correct size even in this extreme admixture scenario, we also simulate a larger (1,000 trio) sample size scenario in addition to the 500 trio-based simulations described above.
To investigate the power of αPC-based test when the control population differs from the parental population due to population stratification, we simulated three scenarios under the alternative. The simulation structure used in these analyses was entirely similar to the population stratification simulations under the null described in the paragraph above except that: (1) we assumed that 30% of the variants in the gene affected disease risk and that the effect of each causal variant was related to its minor allele frequency through the framework described above; and (2) we assumed different mixtures of Africans and Europeans in the controls and trios. Specifically, scenario 1 generated population controls by sampling from a source population with a 20:80 mix of Europeans to Africans, and the trio parent population was comprised of an 80:20 mix. For the second scenario we generated the control source population so that it was comprised of a 40:60 mix of Europeans to Africans, and the trio parent population was comprised of a 60:40 mix. The third scenario involved a control source population that was a 60:40 mix of Europeans to Africans, and the trio parent population was comprised of a 80:20 mix.
For each simulated data set we analyzed the data using both TLC(α) and TK(α), each utilizing three distinct estimates of α. First, we give all variants equal weight when combining score statistics so that α = 1 = (1,…,1)T. Second, we weight variant contributions by a function that gives increased weight to rarer variants. Specifically, for the kth variant, we take αk to be the probability density that a beta(1,25) random variable is MAFk. This is the same weighting scheme used in SKAT.11,13 We denote this scheme by αMAF. Third, we use the population control based weights, αPC, detailed above.
Results
Simulation Comparison of Different Methods
Our simulations compared the power and type I error of TLC(1), TLC(αMAF), TLC(αPC), TK(1), TK(αMAF), and TK(αPC). Table 1 summarizes this comparison when all disease-associated alleles are rare (MAF ≤ 0.01) and are risk alleles. When the proportion of causal alleles is zero, i.e., the gene is not associated with disease, all tests maintain the nominal rate. When the proportion of causal alleles is greater than zero, we see substantial power differences between tests. First, we see that the linear combination-based tests have greater power than the kernel-based tests. This is perhaps not surprising given that all causal variants increase disease risk (i.e., there are no protective variants in this simulation). We do not see substantial power differences between TLC(1) and TLC(αMAF) nor between TK(1) and TK(αMAF), presumably because all causal variants are rare. However, we see that the αPC-based tests offer substantially more power when compared to the other kernel-based or linear combination tests. TLC(αPC) obtains the best power overall, yielding, for example when 10% of the variants are causal, a dramatic, greater than 3-fold increase in power over the next best, non-αPC, test (TLC(αMAF) = 0.098; TLC(αPC) = 0.317). Table 2 compares the tests when the causal variants are rare (MAF ≤ 0.01) but include both risk and protective alleles. Consistent with previous research,13 we found that the kernel-based tests TK(1) and TK(αMAF) had higher power than the linear combination-based tests TLC(1) and TLC(αMAF). Again, we see little difference in power between the α = 1-based tests and the αMAF-based tests. However, the αPC-based tests are far more powerful. For example, when 20% of variants are causal, TLC(αPC) yields a greater-than-4-fold increase in power over the most powerful α = 1-based or αMAF-based test (TLC(αMAF) = 0.139; TLC(αPC) = 0.625). It is also notable that TLC(αPC) outperforms TK(αPC), presumably because the linear combination defined by αPC correctly captures protective and risk effects.
Table 1.
Comparison of Power and Type I Error when All Disease-Susceptibility Variants Confer Risk and Only Rare Variants Are Included in Analyses
| Proportion of Variants that Are Causal | TK(1) | TK(αMAF) | TK(αPC) | TLC(1) | TLC(αMAF) | TLC(αPC) |
|---|---|---|---|---|---|---|
| 0 | 0.043 | 0.041 | 0.038 | 0.046 | 0.047 | 0.046 |
| 0 (α = 0.01) | 0.006 | 0.006 | 0.005 | 0.010 | 0.010 | 0.008 |
| 0.05 | 0.066 | 0.059 | 0.154 | 0.055 | 0.054 | 0.162 |
| 0.1 | 0.060 | 0.062 | 0.221 | 0.097 | 0.098 | 0.317 |
| 0.15 | 0.114 | 0.110 | 0.297 | 0.195 | 0.198 | 0.48 |
| 0.2 | 0.196 | 0.189 | 0.394 | 0.38 | 0.376 | 0.686 |
| 0.25 | 0.253 | 0.283 | 0.385 | 0.64 | 0.659 | 0.725 |
| 0.3 | 0.302 | 0.327 | 0.478 | 0.72 | 0.738 | 0.844 |
Analyses include only rare variants (MAF ≤ 0.01). Rows correspond to different proportions of variants in the gene that are disease causal (0 corresponds to the null hypothesis). Columns correspond to various tests considered: TK, kernel based; TLC, linear combination based; 1, unweighted; αMAF, inversely weighted by minor allele frequency; αPC, population control based. All tests conducted at the 0.05 α level except where noted (row 2).
Table 2.
Power Comparison when Causal Variants Are Comprised of Both Risk and Protective Mutations
| Proportion of Variants that Are Causal | TK(1) | TK(αMAF) | TK(αPC) | TLC(1) | TLC(αMAF) | TLC(αPC) |
|---|---|---|---|---|---|---|
| 0.05 | 0.047 | 0.042 | 0.082 | 0.04 | 0.043 | 0.096 |
| 0.1 | 0.069 | 0.077 | 0.294 | 0.054 | 0.053 | 0.351 |
| 0.15 | 0.149 | 0.145 | 0.351 | 0.222 | 0.213 | 0.441 |
| 0.2 | 0.128 | 0.129 | 0.391 | 0.127 | 0.139 | 0.625 |
| 0.25 | 0.240 | 0.254 | 0.401 | 0.129 | 0.125 | 0.704 |
| 0.3 | 0.272 | 0.325 | 0.584 | 0.213 | 0.247 | 0.888 |
Analyses include only rare variants (MAF ≤ 0.01). Rows correspond to different proportions of variants in the gene that are disease causal. Columns correspond to various tests considered: TK, kernel based; TLC, linear combination based; 1, unweighted; αMAF, inversely weighted by minor allele frequency; αPC, population control based.
The above analyses restricted the variants being analyzed to be rare (MAF ≤ 0.01). Here we investigate the effect of analyzing both rare and common variants, but where the common variants are in fact neutral. For simplicity, we simulated under the scenario when all disease-associated variants are risk alleles. As shown in Table 3, when introducing common neutral variants, TLC(αMAF) and TK(αMAF) had higher power than the unweighted tests TLC(1) and TK(1). Because all common variants are neutral variants, αMAF down-weights these common variants, minimizing the “noise” that results from analyzing a large number of neutral variants. However, even so, the αPC-based test still beat all other methods by a significant margin. For example, when 10% of variants are causal, TLC(αPC) showed a greater-than-5-fold increase in power over the unweighted and αMAF-based tests (TLC(αMAF) = 0.085; TLC(αPC) = 0.438).
Table 3.
Comparison of Power and Type I Error when All Disease-Susceptibility Variants Confer Risk and Both Rare and Common Variants Are Included in Analyses
| Proportion of Variants that Are Causal | TK(1) | TK(αMAF) | TK(αPC) | TLC(1) | TLC(αMAF) | TLC(αPC) |
|---|---|---|---|---|---|---|
| 0 | 0.050 | 0.048 | 0.050 | 0.052 | 0.049 | 0.050 |
| 0(α = 0.01) | 0.009 | 0.009 | 0.010 | 0.010 | 0.009 | 0.011 |
| 0.05 | 0.059 | 0.064 | 0.260 | 0.054 | 0.051 | 0.200 |
| 0.1 | 0.060 | 0.080 | 0.387 | 0.058 | 0.085 | 0.438 |
| 0.15 | 0.077 | 0.175 | 0.559 | 0.059 | 0.154 | 0.706 |
| 0.2 | 0.090 | 0.216 | 0.584 | 0.098 | 0.288 | 0.781 |
| 0.25 | 0.167 | 0.708 | 0.931 | 0.244 | 0.624 | 0.975 |
| 0.3 | 0.205 | 0.749 | 0.931 | 0.404 | 0.805 | 0.977 |
Analyses include rare and common variants (MAF ≤ 0.05). Rows correspond to different proportions of variants in the gene that are disease causal (0 corresponds to the null hypothesis). Columns correspond to various tests considered: TK, kernel based; TLC, linear combination based; 1, unweighted; αMAF, inversely weighted by minor allele frequency; αPC, population control based. All tests conducted at the 0.05 α level except where noted (row 2).
Table 4 summarizes type I error rates for TLC(αPC) and TK(αPC) for null simulations in the presence of confounding due to population stratification and admixture (QQ-plots of these analyses can be seen in Figure S1 available online). As expected, type I error is well controlled throughout, further illustrating that utilizing the population control based coefficients (αPC) does not affect the robustness of the rvTDT to population stratification.
Table 4.
Type I Error Rate of αPC-Based Tests under Population Stratification and Population Admixture
| Scenario |
α = 0.05 |
α = 0.01 |
||
|---|---|---|---|---|
| TK(αPC) | TLC(αPC) | TK(αPC) | TLC(αPC) | |
| PS1 | 0.049 | 0.048 | 0.010 | 0.010 |
| PS2 | 0.047 | 0.051 | 0.011 | 0.011 |
| ADMIX1 | 0.053 | 0.052 | 0.011 | 0.010 |
| ADMIX2 | 0.048 | 0.049 | 0.009 | 0.010 |
Scenarios are as follows: PS1, population stratification with 50% African and 50% European individuals in both controls and trios; PS2, population stratification with 80% Africans and 20% Europeans in controls and 20% Africans and 80% Europeans in trios; ADMIX1, population admixture with 500 trios; ADMIX2, population admixture with 1,000 trios. TK, kernel-based test; TLC, linear-combination-based test.
Table 5 compares the power of the various tests in the presence of population stratification. In the extreme example where 80% of the controls are African (the rest being European) and the trios were 20% African (the rest being European), there is a slight loss of power of the αPC-based tests relative to those tests that do not use population control information. However, even in the presence of substantial stratification (scenarios 2 and 3), we see a power gain in using the αPC-based tests, even though the population control based coefficients (αPC) are probably biased.
Table 5.
Power Comparison under Population Stratification
| Scenario | TK(1) | TK(αMAF) | TK(αPC) | TLC(1) | TLC(αMAF) | TLC(αPC) |
|---|---|---|---|---|---|---|
| 1 | 0.566 | 0.625 | 0.508 | 0.761 | 0.783 | 0.693 |
| 2 | 0.201 | 0.232 | 0.383 | 0.485 | 0.525 | 0.696 |
| 3 | 0.231 | 0.264 | 0.447 | 0.464 | 0.509 | 0.735 |
Scenarios are as follows: 1, population stratification with 80% Africans and 20% Europeans in controls and 20% Africans and 80% Europeans in trios; 2, population stratification with 60% Africans and 40% Europeans in controls and 40% Africans and 60% Europeans in trios; 3, population stratification with 40% Africans and 60% Europeans in controls and 20% Africans and 80% Europeans in trios. TK, kernel-based test; TLC, linear-combination-based test.
Application to Epileptic Encephalopathy
The epileptic encephalopathies (EEs [MIM 308350]) are a group of devastating childhood seizure disorders, characterized by early seizure onset and cognitive and behavioral features associated with ongoing seizure activity. Though large genetic risk factors have been identified,23 EE is known to be heterogeneous and is clearly a complex trait.24 Recent work has pointed to a role for de novo mutations (i.e., mutations that are present in the affected child but absent in both parents) in EE etiology.23 Though such work has identified a number of new EE genes, most of the trios studied are not explained by de novo mutations in these or other known EE genes. Thus, if these unexplained trios have a genetic cause, it is due either to mutations in other “unknown” EE genes or to inherited mutations within the known EE genes. Here, by using the approach detailed above, we test the second of these possibilities: that inherited variants within known EE genes contribute to EE risk.
The study was comprised of 264 whole-exome-sequenced EE trios. This study was carried out in compliance with the institutional review board at Duke University and the relevant ethics boards at the collection sites. Informed consent was obtained from all study participants or their legal guardians. For 41 trios, at least one family member was sequenced via lymphoblastoid cell lines (LCLs). It is well known that sequence differences can arise as part of the LCL immortalization process. This was not an issue for the original study, which focused on de novo variants, each of which were confirmed via Sanger sequencing of whole blood. However, confirming all inherited variants via Sanger sequencing is unrealistic, and so we restrict our analysis to the 223 trios sequenced entirely from whole blood. We further restricted our analysis by excluding trios whose disease was probably explained by de novo mutations in known EE genes, resulting in a final analysis data set of 149 trios.
To avoid transmission bias introduced by jointly calling trio genotypes (i.e., transmitted genotypes would be more likely to be called), each sample was called separately by GATK.25 The entire set of 6,503 samples from the exome-sequencing project (ESP) data set was used as population controls.14 Before our analysis, we applied a series of quality control steps. Variants were removed when more than 20% of families had coverage of less than 20 in at least one family member. Trios were included in analyses of a given site only if all three family members had 20-fold or greater coverage at that site.
Because we lack the power to detect reasonable effects on a per-gene basis, we formed a single test by combining variants across all the known autosomal EE genes: SCN1A (MIM 182389), SCN2A (MIM 182390), MAPK10 (MIM 602897), STXBP1 (MIM 602926), SPTAN1 (MIM 182810), KCNT1 (MIM 608167), SLC25A22 (MIM 609302), SCN8A (MIM 600702), GABRB3 (MIM 137192), PNKP (MIM 605610), KCNQ2 (MIM 602235), and PLCB1 (MIM 607120). Thus we are addressing the question of whether there is increased transmission of inherited rare variants across the entire collection of EE genes. We found no evidence of such increased transmission (Table 6). To illustrate that our approach controls size in a real data example, we applied the proposed methods to each gene in the exomes of these 149 trios. Due to the small number of trios and the very few number of variants within each gene, the asymptotic approximation may not be accurate. Thus, in this analysis, we employ a permutation approach in which recombinations are assumed not to occur within a gene and transmitted and untransmitted alleles are randomly permuted (this is the same permutation strategy used in Ionita-Laza et al.13). No inflation is observed (Figure S2).
Table 6.
Analysis Results for “Known” Autosomal Epileptic Encephalopathy Genes
| Sample Size | n-snv |
p Values |
|||||
|---|---|---|---|---|---|---|---|
| TK(1) | TK(αMAF) | TK(αPC) | TLC(1) | TLC(αMAF) | TLC(αPC) | ||
| 149 | 109 | 0.32 | 0.34 | 0.34 | 0.29 | 0.39 | 0.38 |
Columns correspond to various tests considered: TK, kernel based; TLC, linear combination based; 1, unweighted; αMAF, inversely weighted by minor allele frequency; αPC, population control based.
Given the null result obtained from this analysis, the question immediately arises concerning the types of effect that would have been likely to be detected in the analysis above. To address this question, with respect to this particular data set, we conducted a power analysis that conditioned on the observed parental data but where transmission from parent to offspring at a collection of randomly chosen sites was govern by the log relative risk parameter, via the conditional-on-parent-genotype likelihood. Specifically, we randomly selected a proportion of rare variants (MAF ≤ 1% in general population) as the disease-causal alleles. The odds ratios for these variants were determined by their minor allele frequencies in the exome-sequencing project, i.e., we let OR = 1 + η/2MAF, where η is the per-locus population attributable fraction (PAF). Transmission from parent to offspring was random, but we generated the disease status of children by a logistic model with additive effects of all alleles and retrospectively selected the affected children for each family. During this simulation, we assume the transmissions are independent between each variant.
We would expect that the allele frequency at a true causal variant would be elevated among both affected offspring and their parents. Use of the logistic model to simulate disease status among offspring ensures that causal allele is more common in affected offspring than it is in their parents. To model the fact that causal variants will appear to be enriched in parents relative to population controls, we note that for low-frequency alleles, the allele frequency in affected persons should be elevated by a factor of eβ compared to that in population controls. Further, the allele frequency in parents of affected offspring should be the average of that in offspring and population controls. Because we are conditioning our simulation of offspring genotypes on the parental genotypes in the EE trio data, this implies that rather than sample population control alleles according to the allele frequencies in the ESP, we should instead sample population control data using the allele frequencies
where pESP are the original allele frequencies observed in the exome sequencing project data and pPC are the allele frequencies for controls used in our simulation. Note that when β = 0 (i.e., when the variant is not causal), pPC = pESP. We identified which combinations of PAF and causal allele proportions led to 80% power to detect increased transmission. As can be seen from Figure 1, this analysis can exclude genetic architectures comprised of a moderate proportion of rare variants of large affect. Further, the power boost obtained by utilizing the population controls is quite apparent. Though this analysis constitutes the best powered interrogation of rare inherited risk factors in EE genes previously identified through de novo mutation studies, larger sample sizes will be needed to fully address whether inherited variants within these genes play an important role in EE etiology.
Figure 1.

Power Analysis Conditional on the Parental Sequence and ESP Population Controls
The combination of population attributable risk and the causal variant proportion under which the tests achieve 80% power.
Discussion
We have proposed a data-driven approach to forming powerful linear combinations of variants in rare-variant transmission distortion tests. Unlike other data-driven approaches to identifying optimal linear combinations of rare variants, our approach does not require permutation resampling to compute its asymptotic null distribution. Instead, because the rvTDT is conditional on parental genotype, using parental data in deriving powerful linear combinations does not affect the validity or asymptotic null distribution of the rvTDT. By using simulated data, we have shown that our test has good performance regardless of whether all causal loci act in the same direction and outperforms other weighting schemes in common usage.
In this paper, we focus on linear combinations that are constructed marginally, i.e., one locus at a time. This approach was motivated by the fact that the best exome-sequence data for use as population controls reports only genotype counts and not individual sequences. We expect this will change over time and that even larger collections of population controls comprised of individual sequences will become available. In fact, this is already happening: ESP has recently released individual-level data through dbGAP. With individual sequences, better estimates of the linear combinations used in rvTDT might be obtained by jointly estimating the effect of all variants in the population control/parental comparison. For example, one could use a logistic regression model, perhaps with regularization. We are currently investigating this approach and plan to present our findings in a subsequent manuscript.
We show that the population control weighted rvTDT proposed here maintains its validity in the presence of population stratification. However, systematic differences in ancestry between population controls and parents could cause population-specific alleles to be up-weighted (or down-weighted) inappropriately, leading to a loss of power. Using simulation we investigated the effect of confounding due to population stratification (in the population-control/parents comparison) on the power of αPC-based tests. We found that the αPC-based tests still perform well. In fact, we found that only the most extreme scenarios would lead the αPC-based test to have lower power than the unweighted test. Even with fairly strong confounding due to population stratification, the αPC-based test still outperformed the unweighted test. Further, in most cases, much of the power lost due to using biased weights should be able to be reclaimed by using approaches for adjusting for confounding due to population stratification6 in the comparison of population controls with parents. We plan to consider this in a subsequent manuscript. In the absence of such an adjustment approach, in order to maximize power, we recommend that attempts be made to identify a population control sample that is as representative of the population from which the trios are drawn as possible.
We presented two rvTDTs that utilize population control information: TLC(αPC) and TK(αPC). TK used a SKAT-like approach to accumulating information across a gene or other genetic unit and, as such, we expected TK(αPC) to outperform TLC(αPC) when there was a mix of risk and protective variants within a gene or when there was a substantial proportion of neutral variants. However, we found TLC(αPC) was more powerful than TK(αPC) across all simulations. For this reason, we recommend TLC(αPC) be used in applications.
The power gains obtained by our approach are quite pronounced and, as such, may have a profound impact on genetic discovery. This is crucial, because many current exome-sequencing projects (especially those designed to interrogate de novo mutations) are moderately to severely underpowered to detect inherited variation. Using the methods presented here, it may be possible to turn an underpowered study into one with reasonable power to detect realistic genetic effects.
Acknowledgments
We thank the patients and investigators of Epi4k and the Epilepsy Phenome/Genome project for access to the epileptic encephalopathy data. This work was supported by grants from the National Institute of Neurological Disorders and Stroke (The Epilepsy Phenome/Genome Project NS053998; Epi4K Project 1 – Epileptic Encephalopathies NS077364; Epi4K – Administrative Core NS077274; Epi4K Sequencing, Biostatistics and Bioinformatics Core NS077303; and Epi4K – Phenotyping and Clinical Informatics Core NS077276). M.P.E. is a consultant for Amnion Laboratories and was also supported by a grant from the National Human Genome Research Institute (HG007508). The findings and conclusions in this report are those of the authors and do not necessarily represent the official position of the Centers for Disease Control and Prevention.
Appendix A: Robustness of Population Control Weighted Tests
We consider both linear-combination- and kernel based-test. For the linear-combination-based test, we need to show that under the null hypothesis, where
For the population-control-weighted version, the αjs are functions of parental data. To make this explicit, we write αj(M,P). Thus we have
Note that this result holds regardless of the weight function αj(M,P) or whether these weights are “correctly” estimated. The kernel-based test can be written as
where U is the n × k matrix with i,j component uij and D is the k × k diagonal matrix with diagonal elements α1, α2, … αk. Using standard theory of quadratic forms in normal variables, we can show that the asymptotic distribution of this statistic will follow a mixture of χ2 distribution, under the null hypothesis, as long as DUT1n has expectation 0 (the k dimensional zero vector).
However, it is easy to show that for all j with the same conditional expectation argument as above. Note, again, that this result holds regardless of the weight function αj(M,P) or whether these weights are “correctly” estimated.
Appendix B: Relationship between Odds Ratio and PAF
The population attributable risk fraction (PAF) is used to describe the proportion of disease that can be attributed to exposure.26,27
where RR is the relative risk of the exposure and Pe is the prevalence of exposure among the controls. Solving the equation for RR:
Assuming Hardy-Weinberg equilibrium, let p represented the frequency of causal alleles,
similarly if PAF is small, 1 − PAF ≈ 1, with these two approximations, we have
When the disease is rare, the odds ratio approximates the relative risk. In our simulation setting β = log OR gives
Supplemental Data
Web Resources
The URLs for data presented herein are as follows:
Genetic Analyses in Epileptic Encephalopathies, http://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000654.v1.p1
NHLBI Exome Sequencing Project (ESP) Exome Variant Server, http://evs.gs.washington.edu/EVS/
Online Mendelian Inheritance in Man (OMIM), http://www.omim.org/
rvTDT package on the Comprehensive R Archive Network, http://cran.us.r-project.org/web/packages/rvTDT/index.html
References
- 1.Maher B. Personal genomes: The case of the missing heritability. Nature. 2008;456:18–21. doi: 10.1038/456018a. [DOI] [PubMed] [Google Scholar]
- 2.Cirulli E.T., Goldstein D.B. Uncovering the roles of rare variants in common disease through whole-genome sequencing. Nat. Rev. Genet. 2010;11:415–425. doi: 10.1038/nrg2779. [DOI] [PubMed] [Google Scholar]
- 3.Gibson G. Rare and common variants: twenty arguments. Nat. Rev. Genet. 2011;13:135–145. doi: 10.1038/nrg3118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Bacanu S.A., Devlin B., Roeder K. The power of genomic control. Am. J. Hum. Genet. 2000;66:1933–1944. doi: 10.1086/302929. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Kang H.M., Zaitlen N.A., Wade C.M., Kirby A., Heckerman D., Daly M.J., Eskin E. Efficient control of population structure in model organism association mapping. Genetics. 2008;178:1709–1723. doi: 10.1534/genetics.107.080101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Price A.L., Patterson N.J., Plenge R.M., Weinblatt M.E., Shadick N.A., Reich D. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 2006;38:904–909. doi: 10.1038/ng1847. [DOI] [PubMed] [Google Scholar]
- 7.Mathieson I., McVean G. Differential confounding of rare and common variants in spatially structured populations. Nat. Genet. 2012;44:243–246. doi: 10.1038/ng.1074. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Madsen B.E., Browning S.R. A groupwise association test for rare mutations using a weighted sum statistic. PLoS Genet. 2009;5:e1000384. doi: 10.1371/journal.pgen.1000384. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Morgenthaler S., Thilly W.G. A strategy to discover genes that carry multi-allelic or mono-allelic risk for common diseases: a cohort allelic sums test (CAST) Mutat. Res. 2007;615:28–56. doi: 10.1016/j.mrfmmm.2006.09.003. [DOI] [PubMed] [Google Scholar]
- 10.Neale B.M., Rivas M.A., Voight B.F., Altshuler D., Devlin B., Orho-Melander M., Kathiresan S., Purcell S.M., Roeder K., Daly M.J. Testing for an unusual distribution of rare variants. PLoS Genet. 2011;7:e1001322. doi: 10.1371/journal.pgen.1001322. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Wu M.C., Lee S., Cai T., Li Y., Boehnke M., Lin X. Rare-variant association testing for sequencing data with the sequence kernel association test. Am. J. Hum. Genet. 2011;89:82–93. doi: 10.1016/j.ajhg.2011.05.029. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.De G., Yip W.K., Ionita-Laza I., Laird N. Rare variant analysis for family-based design. PLoS ONE. 2013;8:e48495. doi: 10.1371/journal.pone.0048495. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Ionita-Laza I., Lee S., Makarov V., Buxbaum J.D., Lin X. Family-based association tests for sequence data, and comparisons with population-based association tests. Eur. J. Hum. Genet. 2013;21:1158–1162. doi: 10.1038/ejhg.2012.308. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Schaid D.J., Sommer S.S. Genotype relative risks: methods for design and analysis of candidate-gene association studies. Am. J. Hum. Genet. 1993;53:1114–1126. [PMC free article] [PubMed] [Google Scholar]
- 15.Self S.G., Longton G., Kopecky K.J., Liang K.Y. On estimating HLA/disease association with application to a study of aplastic anemia. Biometrics. 1991;47:53–61. [PubMed] [Google Scholar]
- 16.Li B., Leal S.M. Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am. J. Hum. Genet. 2008;83:311–321. doi: 10.1016/j.ajhg.2008.06.024. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Price A.L., Kryukov G.V., de Bakker P.I., Purcell S.M., Staples J., Wei L.J., Sunyaev S.R. Pooled association tests for rare variants in exon-resequencing studies. Am. J. Hum. Genet. 2010;86:832–838. doi: 10.1016/j.ajhg.2010.04.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Duchesne P., De Micheaux P.L. Computing the distribution of quadratic forms: Further comparisons between the Liu-Tang-Zhang approximation and exact methods. Comput. Stat. Data Anal. 2010;54:858–862. [Google Scholar]
- 19.Davies R.B. Algorithm AS 155: The distribution of a linear combination of chi-square random variables. J. R. Stat. Soc. Ser. C Appl. Stat. 1980;29:323–333. [Google Scholar]
- 20.Lin D.Y., Tang Z.Z. A general framework for detecting disease associations with rare variants in sequencing studies. Am. J. Hum. Genet. 2011;89:354–367. doi: 10.1016/j.ajhg.2011.07.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Guedj M., Nuel G., Prum B. A note on allelic tests in case-control association studies. Ann. Hum. Genet. 2008;72:407–409. doi: 10.1111/j.1469-1809.2008.00438.x. [DOI] [PubMed] [Google Scholar]
- 22.Schaffner S.F., Foo C., Gabriel S., Reich D., Daly M.J., Altshuler D. Calibrating a coalescent simulation of human genome sequence variation. Genome Res. 2005;15:1576–1583. doi: 10.1101/gr.3709305. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Allen A.S., Berkovic S.F., Cossette P., Delanty N., Dlugos D., Eichler E.E., Epstein M.P., Glauser T., Goldstein D.B., Han Y., Epi4K Consortium. Epilepsy Phenome/Genome Project De novo mutations in epileptic encephalopathies. Nature. 2013;501:217–221. doi: 10.1038/nature12439. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Poduri A., Lowenstein D. Epilepsy genetics—past, present, and future. Curr. Opin. Genet. Dev. 2011;21:325–332. doi: 10.1016/j.gde.2011.01.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.McKenna A., Hanna M., Banks E., Sivachenko A., Cibulskis K., Kernytsky A., Garimella K., Altshuler D., Gabriel S., Daly M., DePristo M.A. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010;20:1297–1303. doi: 10.1101/gr.107524.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Choi B.C. Population attributable fraction: comparison of two mathematical procedures to estimate the annual attributable number of deaths. Epidemiol. Perspect. Innov. 2010;7:8. doi: 10.1186/1742-5573-7-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Northridge M.E. Public health methods—attributable risk as a link between causality and public health action. Am. J. Public Health. 1995;85:1202–1204. doi: 10.2105/ajph.85.9.1202. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
