Abstract
Sequence-based association studies are at a critical inflexion point with the increasing availability of exome-sequencing data. A popular test of association is the sequence kernel association test (SKAT). Weights are embedded within SKAT to reflect the hypothesized contribution of the variants to the trait variance. Because the true weights are generally unknown, and so are subject to misspecification, we examined the efficiency of a data-driven weighting scheme.
We propose the use of a set of theoretically defensible weighting schemes, of which, we assume, the one that gives the largest test statistic is likely to capture best the allele frequency-functional effect relationship. We show that the use of alternative weights obviates the need to impose arbitrary frequency thresholds in sequence data association analyses. As both the score test and the likelihood ratio test (LRT) may be used in this context, and may differ in power, we characterize the behavior of both tests.
We found that the two tests have equal power if the set of weights resembled the correct ones. However, if the weights are badly specified, the LRT shows superior power (due to its robustness to misspecification). With this data-driven weighting procedure the LRT detected significant signal in genes located in regions already confirmed as associated with schizophrenia – the PRRC2A (P=1.020E-06) and the VARS2 (P=2.383E-06) – in the Swedish schizophrenia case-control cohort of 11,040 individuals with exome-sequencing data.
The score test is currently preferred for its computational efficiency and power. Indeed, assuming correct specification, in some circumstances the score test is the most powerful. However, LRT has the advantageous properties of being generally more robust and more powerful under weight misspecification. This is an important result given that, arguably, misspecified models are likely to be the rule rather than the exception in weighting-based approaches.
Introduction
With the increasing availability of exome/genome-sequencing data, rare variant association studies are gaining importance in human genetic research. One important test of association between a target set of rare variants (RVs) and a given phenotype is the sequence kernel-based association test (SKAT) [10,23,27,30,31,47,50]). SKAT is based on a random effects model, in which the effect sizes of the RVs are assumed to be drawn from a distribution with a zero mean and a variance. That the effect sizes are characterized by a single variance is a strong assumption which is made plausible by weighting of effect sizes. The required weights are typically assigned based on meta-information about the tested variants, such as allele frequency and functional predictions [26,33,40,50], with rarer and functional variants expected to have larger effects. Allele frequency, in particular, is an important weighting factor, as the rarer the variant is, the stronger the average purifying selection coefficient [41,45]. Accordingly, the effect sizes for rare variants will tend to be larger than for more common variants.
The relationship between effect size, frequency and selection, however, rests on directional assumptions about the extent of selection on the phenotype in question and the demographic history of the population [18,40,55]. Specifically, there are several conditions that have to hold for the frequency to be genuinely informative about the functional effect that a genetic variant has on the trait, namely: (a) the population under study has not experienced recent severe bottlenecks; (b) the selection on the trait of interest is direct; (c) strong (i.e., selection coefficient s ≥ 10−2.5); and (d) it acts uniformly across the associated genes. Yet, for the reasons detailed below, the circumstances in which these conditions are expected to hold are rather special. First, population genetics theory predicts that the frequency of deleterious variants will vary with the size of the effect the associated trait has on fitness. For instance, risk variants implicated in early-onset diseases (e.g., autism) will be mostly rare, i.e., kept at low frequencies by selection pressures because of the high impact these diseases have on reproductive fitness (Manolio, Collins et al. [34]). In contrast, variants associated with a trait having a negligible effect on fitness (e.g., Alzheimer disease), will likely escape selection and so may occur at relatively high frequencies in the population (Zuk, Schaffner et al. [55]). Second, it should be noted that even if the trait of interest is under strong selection pressure, variants across the whole frequency spectrum may jointly contribute to disease risk, as simulation studies (Price, Kryukov et al. [40]) and empirical results (e.g., Cohen, Boerwinkle et al. [11], Teslovich, Musunuru et al. [48]) have demonstrated. Third, allele frequency distribution is expected to vary as a function of the demographic history of the population. Using population genetics simulations, Zuk et al. [55] showed that given the same selection coefficient s, the frequency of deleterious alleles influencing a trait will depend on mutation rate and on whether the population under study has encountered recent severe bottlenecks. For example, given strong selection pressures (i.e., s ≥ −10−2.5) acting directly on the phenotype, the median frequency of the associated alleles may vary from as high as 0.0377 in recently bottlenecked populations (e.g., Finland), to as low as 9.36E-005 in a large population with simple exponential expansion. Finally, the strength of selection is expected to vary across genes, and so will the allele frequency-functional effect relationship (Price, Kryukov et al. [40], Zuk, Schaffner et al. [55]). Genes under weak selection will harbor both common and rare variants, both with functional effects, whereas functional variants within genes under strong selective constraints will mainly be rare. The examples above indicate that testing genomic regions by relying on a weighting scheme which up-weights rarer variants and puts low or zero weights on the more common ones is optimal only in specific circumstances.
Because the true weights are generally unknown, and therefore subject to misspecification, we examined the efficiency of a data-driven weighting scheme. We propose the use of a set of theoretically defensible weighting schemes, of which, we assume, the one that gives the largest test statistic is likely to capture best the allele frequency-functional effect relationship. The set of alternative weighting schemes will accommodate genomic regions where only very rare variants are likely to be functional, as well as regions under weak selection pressures, harboring both rare and common variants, both (possibly) related to the risk of the disease of interest. As such, this adaptive weighting procedure renders the (arbitrary) MAF thresholding unnecessary. Family-wise error rate can be protected by using a multiple testing correction method (e.g., the Bonferroni method) or by using permutation. Using simulations we demonstrate that the use of alternative (incorrect) weights does not inflate the type I error rate. We show the power benefits conferred by the use of such a data-driven weighting procedure in both simulated and empirical data. As both the score test [50] and the likelihood ratio test [31] may be used in this context, and may differ in power [53], we characterize the behavior of both tests.
Below we first formulate the model and briefly describe the likelihood ratio test and the score test. We then present and evaluate the use of a data-driven weighting scheme in simulated and empirical data. Specifically, we evaluate the efficiency of the two tests under (a) the data-driven weighting scheme, relative to their efficiency under (b) incorrect, and (c) correct weighting. Finally, we discuss the robustness of the two tests to misspecification, and the power advantages conferred by our proposed weighting procedure in SKAT.
Material and Methods
Model formulation
Let y be the n-dimensional vector of continuous phenotypic scores obtained in a sample of n individuals. Let X be the n × p design matrix containing covariates. Let G be the n × m matrix of genotype values, with the gij element denoting the genotype value of the individual i (i = 1 … n) at locus j (j = 1 … m). Genotypes are coded as additive-codominant, i.e., gij = (0, 1, 2). The association between the phenotype and the set of m variants is modeled within the linear mixed model framework as:
(1) |
with βt = (β1, … βp) being the p-dimensional vector of fixed effects of covariates, bt = (b1, … bm) being the m × 1 vector of regression coefficients in the regression of the phenotype on the m genetic variants within the target set, and e being the n-dimensional vector of random residuals. The random vectors b and e are assumed to be normally distributed: and , with I being the identity matrix of appropriate dimension.
Let W be the m × m diagonal matrix containing the weights used to weigh the contribution to the test statistic of the variants in the set. The normally distributed phenotype y has expected mean E[y] = Xβ and variance-covariance matrix:
(2) |
with GWGt being the weighted kernel or genetic relationship matrix. As implemented in the SKAT [50], the diagonal elements of the matrix W, diag(w1 …, wm), are related to the minor allele frequency of the j-th variant by means of the beta density distribution function (dbeta), which is characterized by two shape parameters. The specification of the two shape parameters is informed by the hypothesized relationship between the j-th variant effect and its minor allele frequency (MAF; see section on ‘Weighting’ below).
Tests of variance components
To test whether the parameter of interest deviates significantly from zero, one can employ a likelihood ratio test (LRT) or a score test. The likelihood ratio test is computed as two times the difference between the log-likelihoods of the null model ( constrained to equal 0) and the alternative model ( estimated freely). Parameter estimation can be performed by restricted/residual maximum likelihood [9]:
(3) |
where .
In evaluating the statistical significance of the restricted LRT, we note the null distribution of the test statistic is a mixture of distributions, with the mixture parameters π, the scale parameter a, and the degrees of freedom d on the second component estimated using the computationally efficient permutation-based approach developed by Listgarten et al. [31].
The score test is computed as:
(4) |
with its expected null distribution following a mixture of chi-square distribution and statistical significance assessed by means of the Davies exact method [15].
Data simulation
Phenotypes and genotypes in Hardy-Weinberg equilibrium were generated in samples of n = 10, 000 unrelated individuals. Specifically, we simulated two m-dimensional random vectors of continuous variables representing alleles at m equidistant loci for each individual i from the sample. The vectors were drawn from a multivariate distribution with zero mean and ΣLD correlation matrix. We set ΣLD to equal an identity matrix (as we considered sets of rare variants, expected to be in linkage equilibrium, see e.g., [16]; but see Supplementary material for results based on rare, and rare and common variants in linkage disequilibrium simulated using a coalescent model [2]). The multivariate normally distributed variables were then discretized given chosen thresholds based on the MAF at each locus. We considered MAFs varying randomly between 0.005 and 0.05, sampled from a uniform distribution. Given the vectors of alleles, we then created the m vectors of genotypes, gij. Based on the genotypes, the n × 1 vector of phenotypes, y, was generated as:
(5) |
bj, the regression weight of the variant at the j-th locus, was computed as a function of MAFj and of its contribution to the standardized variance of the polygenic scores [35]. Namely, the regression weights varied with MAF, while their contribution to the genetic variance was equal. Simulating data in this fashion is equivalent to simulation according to dbeta(MAF, .5,.5) weights [50], with weights increasing with decreasing MAF. The variance equaled 0.01 across all scenarios we considered, and . The n-dimensional vector of environmental scores e was drawn from a standard normal distribution N(0, 1).
Data-Driven search for optimal weights: exploring the misspecification space
Because the strength and effectiveness of selection pressures vary across the genome, committing to a single weighting scheme when testing thousands of genes may only capture signal from genes under selection pressures matching the chosen weighting scheme. An optimal weighting scheme should be allowed to vary across the tested genes, to match variable selection pressures. To this end, we evaluated the efficiency of a data-driven search for optimal weights. We carried out simulations to evaluate the efficiency of the LRT and the score test under (a) the variable data-driven weighting scheme, relative to their efficiency under (b) incorrect, and (c) correct weighting.
The m-dimensional vector of weights w was computed using the beta density function, with the j-th element calculated as wj = dbeta(MAFj; a1, a2) given the MAF of the j-th variant and the shape parameters a1 and a2. As described in the previous section, data were simulated according to dbeta(.5,.5) weights (i.e., the true weights increase with decreasing MAF). Next, in computing the test statistic we (mis)specified the weights as: a) dbeta(1,1), b) dbeta(.5,.5), c) dbeta(1,25), and d) dbeta(1,50). The first weighting scheme pertains to the hypothesis that there is no relationship between the regression weight and the frequency of the variant (hence, the more common variants contribute on average more to variation in the phenotype). In this scenario the association test is carried out with raw additive-codominant coding of the genotypes. The use of the second weighting scheme is equivalent to standardization of the genotypic values prior to the analysis. We considered the effect of this weighting scheme as this treatment of the genotypes is default in GCTA [52] and in FaST-LMM-set [31]. Standardization and assignment of weights dbeta(.5,.5) are equivalent weighting schemes [50] in which the contribution to the test of rarer variants is up-weighed relative to that of the more common ones [46], and hence the variants contribute on average equally to the variance in the phenotype, regardless of frequency. We also considered the effects of the third weighting scheme (dbeta(1,25)) as these are the default weights in SKAT [50]. Finally, we considered the effect of a more extreme weighting scheme (dbeta(1,50)), including weights that overlook common variants and favor the contribution to the test statistic of rarer ones. This weighting scheme pertains to the hypothesis that only ultra-rare variants contribute to the phenotypic variance.
We performed association tests by using the set of 3 incorrect weighting schemes, i.e., a) dbeta(1,1); b) dbeta(1,25), and c) dbeta(1,50). The p-value for the gene equaled the minimum Bonferroni corrected p-value minPLRT (minPscore) out of the 3 p-values obtained given the genotypes transformed according to each of the weighting schemes enumerated above. We also report the power of the tests under each of these misspecified weighting schemes, as it is of interest to assess whether our procedure confers power gains relative to a test which uses a single (misspecified) weighting scheme (i.e., 3 tests vs. 1 test). We assessed the behavior of the two tests under the above weighting schemes by considering target regions harboring both deleterious and beneficial variants.
Evaluating the type I error rates and power
We evaluated the type I error rate by generating 1000 datasets under the null hypothesis of no phenotypic variance explained by the variants within the target set. The type I error rate was computed as the proportion of datasets in which the tests incorrectly rejected the null hypothesis and was evaluated given α = 0.01. We refer to Listgarten et al. [31] for an exhaustive evaluation of the type I error rate at more stringent α levels.
Power was assessed based on 1000 simulated datasets, an effect size of 1% explained phenotypic variance and 7 alpha thresholds. Given the 7 alpha thresholds, power equaled the proportion of datasets in which the effect was detected. The p-value was computed using the permutation-based procedure implemented in FaST-LMM-Set [31]. Estimation of the free parameters π, a, and d of the null distribution , used 1000 permutations. As a validity check of our simulation program, we also report the power and the type I error rates of the true (i.e., correct) model.
Software
The R-package MASS [49] was used for data generation. Model fitting was performed in FaST-LMM-set [31]. The software is readily available for use on Github. For the sake of comparison, we analyzed one simulated sample of 5000 individuals by using 4 independent programs implementing genetic similarity/kernel-based variance component tests: the nlme R-package, the software Genome-wide Complex Trait Analysis (GCTA; [52]), the software FaST-LMM-set [31] and the R package OpenMx [6]. The values for the LRT and the estimates for the variance component obtained by the 4 programs were almost identical (see S1 Table Supplementary Material for details), indicating that these implement equivalent approaches. Having established the equivalence, the empirical analyses were conducted using the Fast-LMM-set program. Analyses were carried out on the Broad Institute Gold Compute cluster and on the Lisa cluster (https://www.surf.nl/en).
Empirical analysis: evaluating the importance of thresholding and variable weighting
We compared the performance of the likelihood ratio test and of the score test under our proposed data-driven weighting scheme in a real dataset. For this illustration we used the Swedish schizophrenia case-control cohort of 11,040 individuals with exome-sequencing data from blood DNA. Cases had a clinical diagnosis of schizophrenia and at least two hospitalizations as determined by expert review based on the Hospital Discharge Register [13,25]. Controls, without a diagnosis of schizophrenia or bipolar disorder, were randomly selected from population registries. Both cases and controls are of Scandinavian ancestry, aged 18 or older (see [42,43] for a detailed description of the sample). There were 175 individuals with unreliable samples (i.e., duplicates, ethnic outliers or having a genotype missing rate higher than 10%) whom we removed from the analysis. This left for the analysis 4867 cases and 6173 controls. 6052 of these were males. Written informed consent was obtained from all participants (or legal guardian consent and subject assent). All procedures were approved by the ethical committees in Sweden and in the United States. Data are available through dbGAP.
Exome-sequencing was performed in twelve waves at the Broad Institute of MIT and Harvard. For samples in the first wave, hybrid capture was performed using the Agilent SureSelect Human All Exon Kit method. In this version, the method targets ~28 million base-pairs partitioned in ~160,000 regions. Sequencing was done using Illumina GAII instruments. For samples in the waves two to twelve, hybrid capture was done by using the newer version of the Agilent SureSelect Human All Exon v.2 Kit method, which targets ~32 million base-pairs partitioned in ~190,000 regions. Sequencing was performed using the Illumina HiSeq 2000 and HiSeq 2500 instruments. We used BWA ALN version 0.5.9 [29] to align the reads to the GRCh37 human genome reference and we applied Picard/GATK to process the sequence data and to call variants [36]. Selected singletons were validated using Sanger sequencing (see [42] for details). Variants out of Hardy-Weinberg equilibrium (P-value < 5E-8) and showing excess heterozygosity, or variants showing excessive correlation (P-value < 5E-8) with the covariates (that could not be explained by principal components) were excluded from the analysis. In addition, we excluded variants that did not pass the GATK default filters [8,17]. There were 1,584,195 variants meeting all our quality control criteria.
For this empirical illustration we focused on two partially overlapping sets of genes (1435 genes) likely relevant to schizophrenia. The first set consisted of 941 genes which are part of the list identified by Samocha et al. [44] as highly constrained. These constrained genes were proposed as candidates in autism spectrum disorder (ASD) given their enrichment for de novo loss of function case mutations. Given evidence favouring the hypothesis that schizophrenia and ASD share genetic aetiology [12,20], this set of genes is likely to be relevant also to schizophrenia. The second set consisted of 768 genes targeted by the Fragile-X mental retardation protein (FMRP). This set is part of the list of genes derived by Darnell et al. [14] from mouse brain as likely implicated in regulating synaptic plasticity. Genes targeted by FMRP were found to be enriched for de novo nonsynonymous case mutations in both ASD [24] and schizophrenia [20]. Purcell et al. [42] also tested the FMRP set for enrichment of rare variants in half of the current sample, and their analysis yielded nominally significant results.
We performed sequence-based kernel association analyses using the likelihood ratio and score tests with variable weights. The analysis was carried-out using the software FaST-LMM-Set [31]. To adjust for ancestry we included into analysis two principal components explaining the largest amount of variance in the sample and reflecting the Finish and Northern/Southern Swedish ancestry (see Extended Data Figure 1 in [42], see also [1]). Principal components were computed from genotypes at variants shared with the 1000 Genomes Project phase 1 dataset. To accommodate the scenario in which only rare variants are likely to be functional, as well as the scenario in which the targeted region is under weak selection pressures, harboring both rare and more common variants, both (possibly) related to the risk of disease (regardless of frequency), we used three alternative weighting schemes: dbeta(1,25), dbeta(.5,.5) and dbeta(1,1). The use of alternative weights obviates the need for choosing arbitrary frequency thresholds to select the target set. However, for the sake of illustration, we also report the results obtained in the analyses stratified based on allele counts thresholds (i.e., we selected variants with a minor allele count (MAC) up to 10 and a MAC up to 50).
For each of the tested genes, we selected the Bonferroni corrected p-value corresponding to the weighting scheme that yields the largest test statistic (i.e., the p-value was adjusted for multiple hypothesis testing of 1435 genes and 3 weighting schemes). An alpha of 0.05 was used as the signifcance threshold. For computational ease we used a linear model [31]. The linear LRT (and the linear score test) shows good control of the type I error rate and has performed as well as a generalized linear model in case-control samples (see [30]).
Results
Type I error
Table 1 contains the results pertaining to the type I error rates of the two tests, given correct and incorrect model specification. Both the restricted likelihood ratio test and the score test yield correct type I error rates, regardless of whether the weights used are correctly specified or misspecified. The two tests show good control of the type I error rate also under our proposed Bonferroni data-driven weighting procedure. Note that these conclusions generalize to scenarios in which the target set includes common and rare variants in linkage equilibrium/disequilibrium (see Supplementary Table S2).
Table 1.
weights dbeta | LRT | Score test |
---|---|---|
(.5,.5) | [0.0043, 0.0176] | [0.0030, 0.0150] |
(1,1) | [0.0054, 0.0183] | [0.0037, 0.0163] |
(1,25) | [0.0043, 0.0176] | [0.0030, 0.0150] |
(1,50) | [0.0054, 0.0183] | [0.0037, 0.0163] |
Bonferroni | [0.0018, 0.0123] | [0.0024, 0.0137] |
Power
Figure 1 displays the results relating to power to detect a target set of 50 functional variants 1. Four important conclusions follow from our simulation results. First, the restricted LRT and the score test have equal power when the weights are correctly specified. This is expected, as the two tests are asymptotically equivalent when the model is true, i.e., correctly specified (e.g., [21]). The powers of the two tests – displayed in red in Figure 1 – are similar when the assigned weights correspond to the true weights.
Secondly, misspecification of weights always reduces power. This is shown in Figure 1, as the departure of the power under model misspecification (the colored lines) from the power of the true model (the red lines). The exact loss in power depends on the degree of weight misspecification and on the statistical test employed. We note that the power loss is relatively small given mild misspecification of weights (e.g., when the assigned weights dbeta(1,25) resemble the true weights dbeta(.5,.5), as illustrated by the blue lines in Figure 1). However, the power may suffer dramatically with increasing misspecification. For instance, using a dbeta (1,50) weighting scheme - which acts as a frequency threshold, removing from the test the more common variants - results in a loss in power of up to ~10% and ~34% (given an alpha of 10−7) for the restricted LRT and for the score test, respectively.
Third, relative to the score test, we note that the restricted LRT is consistently more robust to weight misspecification. These results are consistent with those reported by Zeng et al. [53] and by Lippert et al. [30], who found the LRT to be generally more powerful than the score test across their simulated settings. Although Lippert et al. did not consider the behavior of the two tests under misspecified weights, they reported the same pattern of results in real data analysis, where the LRT yielded consistently more associations than the score test. As the real weights are in all likelihood not known, the superior power of the restricted LRT in real data might be explained as well by its robustness to weight misspecification and to the inclusion of weighed neutral variation in the computation of the test statistic.
Fourth, we note that both tests benefit from the use of variable weights. The data-driven search for optimal weights confers power advantages over a model that uses misspecified weights, and maintains the power close to that afforded by a correctly specified model. It should be noted, however, that there is a price to pay in terms of power by using this data-driven weighting scheme in contrast to correct weighting (i.e., using alternative weights increases the burden of multiple testing). The two tests have equal powers with the Bonferroni corrected data-driven weighting procedure; this is due to the fact that the weights resembling the correct ones were included in the procedure (the more weights one tries, the largest the price in terms of power one has to pay). Had the procedure included weights misspecified to a greater extent, the power of the score test would have decreased relative to that of the LRT (which appears to be more robust to misspecification). As the true weights are typically unknown, conjecturing the correct ones by employing the proposed Bonferroni scheme with alternative weights and using the likelihood ratio test appears to be the strategy most likely to maintain the power close to that of the true model.
Empirical analysis: evaluating the importance of thresholding and variable weighting
We next looked at the behavior of the score test and of the likelihood ratio test [31] under variable weights in the empirical dataset. Tables 2 and 3 display results pertaining to the association tests in the analyses stratified based on arbitrary minor allele count (MAC) thresholds.
Table 2.
Chromosome (position range) | Gene (autosome variants) | weights dbeta | LRT | Score test |
---|---|---|---|---|
9 (135762714–135804294) | TSC1 (142) | (1,1) | 0.0001 (0.5596) | 0.0027 (1) |
(.5,.5) | 0.0064 (1) | 0.0256 (1) | ||
(1,25) | 0.0014 (1) | 0.0026 (1) | ||
| ||||
15 (52058632–52100672) | TMOD2 (52) | (1,1) | 0.0004 (1) | 0.0039 (1) |
(.5,.5) | 0.0069 (1) | 0.015 (1) | ||
(1,25) | 0.0034 (1) | 0.0038 (1) | ||
| ||||
4 (62363001–62935992) | LPHN3 (131) | (1,1) | 0.0006 (1) | 0.0029 (1) |
(.5,.5) | 0.0146 (1) | 0.0563 (1) | ||
(1,25) | 0.0041 (1) | 0.0029 (1) |
Table 3.
Chromosome (position range) | Gene (autosome variants) | weights dbeta | LRT | Score test |
---|---|---|---|---|
9 (109685651–109773313) | ZNF462 (224) | (1,1) | 0.0001 (0.4735) | 0.0032 (1) |
(.5,.5) | 0.0078 (1) | 0.0547 (1) | ||
(1,25) | 0.0001 (0.4735) | 0.003 (1) | ||
| ||||
15 (52058615–52100672) | TMOD2 (54) | (1,1) | 0.0002 (1) | 0.0091 (1) |
(.5,.5) | 0.0054 (1) | 0.0063 (1) | ||
(1,25) | 0.0002 (1) | 0.0083 (1) | ||
| ||||
8 (141669548–141900779) | PTK2 (139) | (1,1) | 0.0008 (1) | 0.0034 (1) |
(.5,.5) | 0.0136 (1) | 0.0429 (1) | ||
(1,25) | 0.0008 (1) | 0.0033 (1) |
From Table 2 we note that the likelihood ratio test appears to be more powerful than the score test. The two tests seem to agree in selecting the top association signals, as both ranked in the top three the same genes. All three weighting schemes tend to pick up nominally significant association signals. Of these, the dbeta(1,1) weighting scheme yields the lowest P-value for all three genes. Similar trends in the results were observed when we restricted the analyses to variants with a MAC below 50 (see Table 3).
The use of alternative weights obviates the need of thresholding to prioritize the contribution of the variants to the test statistic (the thresholds are, however, arbitrary: variants defined as rare in one sample might feature as common in another sample). We conducted the analysis using our proposed data-driven weighting scheme, without imposing any frequency threshold. Table 4 contains the results.
Table 4.
Chromosome (position range) | Gene (autosome variants) | weights dbeta | LRT | Score test |
---|---|---|---|---|
6 (31584304–31607461) | PRRC2A (408) | (1,1) | 1.020E-06 (0.0043) | 2.556E-06 (0.011) |
(.5,.5) | 5.8E-04 (1) | 9.886E-05 (0.4255) | ||
(1,25) | 0.055 (1) | 0.057 (1) | ||
| ||||
6 (30877202–30894026) | VARS2 (238) | (1,1) | 2.383E-06 (0.0102) | 0.0043 (1) |
(.5,.5) | 0.0031 (1) | 0.0048 (1) | ||
(1,25) | 1 (1) | 0.534 (1) | ||
| ||||
1 (243668558–244006487) | AKT3 (43) | (1,1) | 2.825E-05 (0.1216) | 7E-04 (0.7533) |
(.5,.5) | 0.0036 (1) | 0.0063 (1) | ||
(1,25) | 1.6E-04 (0.6888) | 7.586E-05 (0.3265) |
For the top three genes, Table 4 shows that the dbeta(1,1) weighting scheme appears to best capture the allele frequency-functional effect relationship. This weighting scheme yields the largest test statistic and singles out the PRRC2A, the VARS2 as significantly associated with schizophrenia disease status given our chosen alpha threshold (i.e., P-value = 1.020E-06 and P-value = 2.383E-06, respectively). The third top gene is the AKT3 gene (P-value=2.825E-05). All three genes belong to the Samocha et al. (2014) list of genes under selection constrains. Had one relied on a weighting scheme that up-weights rarer variants and down-weights the more common ones, these association signals would have been missed. As these genes did not pass the significance threshold in the analyses stratified by MAC, the results suggest that arbitrary thresholding might remove from the target causal variants and in doing so, it might weaken the association signal. We observed similar trends in power when we simulated sets of common and rare functional variants, where – similar to a frequency threshold – the dbeta(1,25) weighting scheme discarded from the target set causal variants (see Supplementary Figure S2). Importantly, association signals in all three genes have been previously reported (e.g., Ripke et al., [4]) and replicated (e.g., Aberg et al., [5]) suggesting that these results are unlikely to be false positives.
Without thresholding, common variants might also be included in the analysis. In our sample, of the 43 (AKT3), 238 (PRRC2A) and 408 (VARS2) tested variants, 1, 29 and 15 variants, respectively, had a MAC greater than 50. The question remains whether the test was dominated by these common variants. We checked in our sample whether the common variants, if tested with a univariate test, do yield genomewide significant association signals. Results showed that none of them would be detected in an ordinary GWAS (see Supplementary Tables S3–S5). Hence either thresholding or relying on a default weighting scheme would result in missing true association signals. We elaborate on these results in the Discussion.
Discussion
We considered the issue of optimizing weighting in association studies based on the sequence kernel test. Consistent with empirical [30] and simulation [53] results we found that the likelihood ratio test is generally more robust to weight misspecification, and more powerful than the score test in such a circumstance. The principal finding of this study is that using a weighting scheme that includes alternative weights is likely to boost statistical power. Our results are of interest because weight assignment is embedded within any set-based test and the true weights of the variants within the target are generally unknown.
In the literature, weighting is mostly informed by allele frequency; frequency is taken as indicative of the strength of the purifying selection coefficient [26]. Accordingly, rarer variants are typically being assigned larger weights/contribution to the test statistic (e.g., [50]). This relationship between effect size, frequency and selection is not always straightforward, however, because it relies on assumptions about the extent of direct selection on the phenotype in question and the demographic history of the population [18,40,55]. Genes under weak selection may harbor rare as well as more common variants with disruptive effects [55]. Such variants with deleterious effects, escaping selection and occurring at relatively high frequencies in the population, are plausible also under strong purifying selection, as simulation studies have demonstrated [40]. Achieving maximal power when testing such regions requires adapting the weighting scheme to match the hypothesized selection types. To this end, we proposed the use of a data-driven weighting approach. Our simulation results showed that such an approach maintains the power close to that of the true (i.e., correctly specified) model. When applied to real data, this approach allowed us to locate previously reported genes conferring risk to schizophrenia (e.g., Aberg et al., [5]; Ripke et al., [4]), lending support to the conclusion that such a variable weighting approach is likely to boost statistical power. Such adaptive approaches were also recommended by Zuk et al. (2014) and by Price et al. (2010) as being optimal for gene-based tests. Deriving weights based on allele frequency is but one of the possible ways of prioritizing the contribution to the test statistic of the variants within the target set [50]. Alternative weighting schemes that incorporate probabilities of a variant being damaging (as estimated by annotation tools such as e.g., Polyphen-2 [7] or SIFT [38] may also be considered.
We emphasize that our data-driven weighting approach renders thresholding unnecessary. Thresholding (either based on counts or on allele frequency) has been initially used in burden tests (e.g., [28,33,40]; see also [19] for an overview on burden tests), but it has been employed also in sequence-based variance component tests (e.g., [32,51] ) for the purpose of removing neutral variation (see e.g., [26]). Yet, in our empirical analysis this practice was counterproductive: imposing the (arbitrarily chosen) MAC thresholds muted the signal in genes located in regions already confirmed as associated with schizophrenia (i.e., the PRRC2A and the VARS2 (e.g., [5]), and the AKT3 (e.g., [4]) genes). Considering common variants along with the rare ones in sequence-based kernel association tests appears to be justified for three main reasons. First, the use of variable weighting schemes is equivalent to applying variable frequency thresholds: the weights are removing from the test or favoring the contribution to the test statistic of the variants within the target set based on their frequency. Second, only the joint signal - coming from rare and more common variants - enabled us to detect significant enrichment. That is, we note that none of these common variants would be detected in an ordinary GWAS (see Supplementary Tables S3-S5). And third, importantly, with the current samples, our tests are mostly powered to locate regions under relatively weak selection pressures, and such regions are expected to harbour rare as well as common variants both with functional effects. To locate genes under stronger selection pressures, larger samples (see [55]) and the inclusion of more extreme weights (i.e., weights that overlook common variants and favour rare ones) will probably be required.
The LRT and the score test had equal power under the data-driven weighting approach. Note, however, that this equivalence hinged upon the inclusion of weights that closely resemble the true ones among the alternatives. The powers of the two tests will likely diverge when the weights in the set are all badly specified; in such a circumstance, the LRT is expected to show superior power (due to its robustness to assumption violation). This is likely illustrated in the empirical analysis where the LRT has always yielded lower p-values. Yet, despite these differences in power, currently the score test is the dominant association test with rare variants involving single studies and also in meta-analyses (see, e.g. [3]. Integrating LRT into meta-analytic techniques for rare-variant association testing is desirable - to ensure maximal power of detection - and will likely boost its application.
Both in the simulations and in the empirical analysis we chose to correct alpha by using the Bonferroni method. We chosen this method for the sake of simplicity. Although one may argue that the method is slightly conservative as the tests are correlated, it is important to note that the Bonferroni corrected weighing procedure confers more power than a badly specified weighting scheme would do. P-value correction for larger number of tests can be easily obtained using the p.adjust function implemented in the stats R-package. Permutation may also be used to compute the p-value. However, the data-driven weighting approach based on permutations is prohibitively slow when the number of tested variants within the target set (or the number of genes) and the sample are large. The Bonferroni correction – though easier computationally – comes at a price in terms of power: the more weighting schemes one ‘tries’, the more stringent the significance threshold correction. An algorithm for optimal search for the ‘true’ weights (e.g., [37]) or limiting the choice of weights based on knowledge on theorized selection on each gene [55] would decrease the burden of multiple testing, and further increase power.
Conclusion
The score test is currently widely used in sequence-based association studies (e.g., [22,39,54] for both its computational efficiency and power [50]. Indeed, assuming correct specification, in some circumstances the score test is the most powerful test [30,50]. However, the results provided herein showed that the likelihood ratio test has the compelling qualities of being generally more robust and more powerful under weight misspecification. This is an important result, given that, arguably, misspecified models are likely to be the rule rather than the exception in the weighting-based approaches.
Acknowledgments
We thank to the Swedish cohort participants whose data we analyzed in this study. Camelia C. Minica and Michael C. Neale are supported by the National Institute on Drug Abuse grant DA-018673. Jacqueline M. Vink is supported by the ERC starting grant 284167
Footnotes
Supplementary Material accompanies the paper on the Twin Research and Human Genetics website.
References
- 1.Genovese Giulio, Fromer Menachem, Stahl Eli A, Ruderfer Douglas M, Chambert Kimberly, Landén Mikael, Moran Jennifer L, Purcell Shaun M, Sklar Pamela, Sullivan Patrick F, et al. Nature Neuroscience. Nature Research; 2016. Increased burden of ultra-rare protein-altering variants among 4,877 individuals with schizophrenia. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Shlyakhter Ilya, Sabeti Pardis C, Schaffner Stephen F. Bioinformatics. Oxford Univ Press; 2014. Cosi2: an efficient simulator of exact and approximate coalescent with selection; pp. 3427–3429. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Tang Zheng-Zheng, Lin Dan-Yu. Meta-analysis for Discovering Rare-Variant Associations: Statistical Methods and Software Programs. The American Journal of Human Genetics. 2015:35–53. doi: 10.1016/j.ajhg.2015.05.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Ripke Stephan, Neale Benjamin M, Corvin Aiden, Walters James TR, Farh Kai-How, Holmans Peter A, Lee Phil, Bulik-Sullivan Brendan, Collier David A, Huang Hailiang, et al. Nature. Europe PMC Funders; 2013. Biological insights from 108 schizophrenia-associated genetic loci; pp. 421–427. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Aberg Karolina A, Liu Youfang, Bukszár Jozsef, McClay Joseph L, Khachane Amit N, Andreassen Ole A, Blackwood Douglas, Corvin Aiden, Djurovic Srdjan, Gurling Hugh, et al. JAMA psychiatry. American Medical Association; 2013. A comprehensive family-based replication study of schizophrenia genes; pp. 573–581. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Neale Michael C, Hunter Michael D, Pritikin Joshua N, Zahery Mahsa, Brick Timothy R, Kirkpatrick Robert M, Estabrook Ryne, Bates Timothy C, Maes Hermine H, Boker Steven M. Psychometrika. Springer; 2015. OpenMx 2.0: Extended structural equation and statistical modeling; pp. 1–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Adzhubei Ivan A, Schmidt Steffen, Peshkin Leonid, Ramensky Vasily E, Gerasimova Anna, Bork Peer, Kondrashov Alexey S, Sunyaev Shamil R. A method and server for predicting damaging missense mutations. Nature methods. 2010;7(4):248–249. doi: 10.1038/nmeth0410-248. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.van der Auwera Geraldine A, Carneiro Mauricio O, Hartl Christopher, Poplin Ryan, del Angel Guillermo, Levy-Moonshine Ami, Jordan Tadeusz, Shakir Khalid, Roazen David, Thibault Joel. From fastq data to high confidence variant calls: The genome analysis toolkit best practices pipeline. Current protocols in bioinformatics. 2013 doi: 10.1002/0471250953.bi1110s43. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Basilevsky A. Applied matrix algebra in the statistical sciences. Elsevier Science Publishing; New York: 1983. [Google Scholar]
- 10.Chen Han, Meigs James B, Dupuis Josée. Sequence kernel association test for quantitative traits in family samples. Genetic epidemiology. 2013;37(2):196–204. doi: 10.1002/gepi.21703. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Cohen Jonathan C, Boerwinkle Eric, Mosley Thomas H, Jr, Hobbs Helen H. Sequence variations in pcsk9, low ldl, and protection against coronary heart disease. New England Journal of Medicine. 2006;354(12):1264–1272. doi: 10.1056/NEJMoa054013. [DOI] [PubMed] [Google Scholar]
- 12.Schizophrenia Working Group of the Psychiatric Genomics Consortium. Biological insights from 108 schizophrenia-associated genetic loci. Nature. 2014;511(7510):421–427. doi: 10.1038/nature13595. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Dalman CH, Broms J, Cullberg J, Allebeck P. Young cases of schizophrenia identified in a national inpatient register. Social psychiatry and psychiatric epidemiology. 2002;37(11):527–531. doi: 10.1007/s00127-002-0582-3. [DOI] [PubMed] [Google Scholar]
- 14.Darnell Jennifer C, Van Driesche Sarah J, Zhang Chaolin, Hung Ka Ying Sharon, Mele Aldo, Fraser Claire E, Stone Elizabeth F, Chen Cynthia, Fak John J, Chi Sung Wook. FMRP stalls ribosomal translocation on mrnas linked to synaptic function and autism. Cell. 2011;146(2):247–261. doi: 10.1016/j.cell.2011.06.013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Davies R. The distribution of a linear combination of chi-square random variables. J R Stat Soc Ser C Appl Stat. 1980;29:323–333. [Google Scholar]
- 16.John Daye Z, Li Hongzhe, Wei Zhi. A powerful test for multiple rare variants association studies that incorporates sequencing qualities. Nucleic acids research. 2012;40(8):e60–e60. doi: 10.1093/nar/gks024. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.DePristo Mark A, Banks Eric, Poplin Ryan, Garimella Kiran V, Maguire Jared R, Hartl Christopher, Philippakis Anthony A, del Angel Guillermo, Rivas Manuel A, Hanna Matt. A framework for variation discovery and genotyping using next-generation dna sequencing data. Nature genetics. 2011;43(5):491–498. doi: 10.1038/ng.806. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Eyre-Walker Adam, Keightley Peter D. The distribution of fitness effects of new mutations. Nature Reviews Genetics. 2007;8(8):610–618. doi: 10.1038/nrg2146. [DOI] [PubMed] [Google Scholar]
- 19.Franić Sanja, Dolan Conor V, Broxholme John, Hu Hao, Zemojtel Tomasz, Davies Garreth E, Nelson Kelly A, Ehli Erik A, Pool René, Hottenga Jouke-Jan. Mendelian and polygenic inheritance of intelligence: A common set of causal genes? using next-generation sequencing to examine the effects of 168 intellectual disability genes on normal-range intelligence. Intelligence. 2015;49:10–22. [Google Scholar]
- 20.Fromer Menachem, Pocklington Andrew J, Kavanagh David H, Williams Hywel J, Dwyer Sarah, Gormley Padhraig, Georgieva Lyudmila, Rees Elliott, Palta Priit, Ruderfer Douglas M. De novo mutations in schizophrenia implicate synaptic networks. Nature. 2014;506(7487):179–184. doi: 10.1038/nature12929. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Greene William H. Econometric analysis. New Jersey: Prentice Hall; 2003. [Google Scholar]
- 22.Huyghe Jeroen R, Jackson Anne U, Fogarty Marie P, Buchkovich Martin L, Stančáková Alena, Stringham Heather M, Sim Xueling, Yang Lingyao, Fuchsberger Christian, Cederberg Henna. Exome array analysis identifies new loci and low-frequency variants influencing insulin processing and secretion. Nature genetics. 2013;45(2):197–201. doi: 10.1038/ng.2507. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Ionita-Laza Iuliana, Lee Seunggeun, Makarov Vlad, Buxbaum Joseph D, Lin Xihong. Sequence kernel association tests for the combined effect of rare and common variants. The American Journal of Human Genetics. 2013;92(6):841–853. doi: 10.1016/j.ajhg.2013.04.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Iossifov Ivan, Ronemus Michael, Levy Dan, Wang Zihua, Hakker Inessa, Rosenbaum Julie, Yamrom Boris, Lee Yoon-ha, Narzisi Giuseppe, Leotta Anthony. De novo gene disruptions in children on the autistic spectrum. Neuron. 2012;74(2):285–299. doi: 10.1016/j.neuron.2012.04.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Kristjansson Einar, Allebeck Peter, Wistedt Börje. Validity of the diagnosis schizophrenia in a psychiatric inpatient register: a retrospective application of dsm-iii criteria on icd-8 diagnoses in stockholm county. Nordic Journal of Psychiatry. 1987;41(3):229–234. [Google Scholar]
- 26.Kryukov Gregory V, Pennacchio Len A, Sunyaev Shamil R. Most rare missense alleles are deleterious in humans: implications for complex disease and association studies. The American Journal of Human Genetics. 2007;80(4):727–739. doi: 10.1086/513473. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Lee Seunggeun, Wu Michael C, Lin Xihong. Optimal tests for rare variant effects in sequencing association studies. Biostatistics. 2012;13(4):762–775. doi: 10.1093/biostatistics/kxs014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Li Bingshan, Leal Suzanne M. Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. The American Journal of Human Genetics. 2008;83(3):311–321. doi: 10.1016/j.ajhg.2008.06.024. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Li Heng, Durbin Richard. Fast and accurate short read alignment with burrows–wheeler transform. Bioinformatics. 2009;25(14):1754–1760. doi: 10.1093/bioinformatics/btp324. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Lippert Christoph, Xiang Jing, Horta Danilo, Widmer Christian, Kadie Carl, Heckerman David, Listgarten Jennifer. Greater power and computational efficiency for kernel-based association testing of sets of genetic variants. Bioinformatics. 2014;30(22):3206–3214. doi: 10.1093/bioinformatics/btu504. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Listgarten Jennifer, Lippert Christoph, Kang Eun Yong, Xiang Jing, Kadie Carl M, Heckerman David. A powerful and efficient set test for genetic markers that handles confounders. Bioinformatics. 2013;29(12):1526–1533. doi: 10.1093/bioinformatics/btt177. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Lohmueller Kirk E, Sparsø Thomas, Li Qibin, Andersson Ehm, Korneliussen Thorfinn, Albrechtsen Anders, Banasik Karina, Grarup Niels, Hallgrimsdottir Ingileif, Kiil Kristoffer. Whole-exome sequencing of 2,000 danish individuals and the role of rare coding variants in type 2 diabetes. The American Journal of Human Genetics. 2013;93(6):1072–1086. doi: 10.1016/j.ajhg.2013.11.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Madsen Bo Eskerod, Browning Sharon R. A groupwise association test for rare mutations using a weighted sum statistic. PLoS genetics. 2009;5(2):e1000384. doi: 10.1371/journal.pgen.1000384. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, McCarthy MI, Ramos EM, Cardon LR, Chakravarti A, Cho JH, Guttmacher AE, Kong A, Kruglyak L, Mardis E, Rotimi CN, Slatkin M, Valle D, Whittemore AS, Boehnke M, Clark AG, Eichler EE, Gibson G, Haines JL, Mackay TFC, McCarroll SA, Visscher PM. Finding the missing heritability of complex diseases. Nature. 2009;461(7265):747–753. doi: 10.1038/nature08494. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Mather K, Jinks JL. Introduction to biometrical genetics. ithaca, ny: Cornell university press; 1977. [Google Scholar]
- 36.McKenna Aaron, Hanna Matthew, Banks Eric, Sivachenko Andrey, Cibulskis Kristian, Kernytsky Andrew, Garimella Kiran, Altshuler David, Gabriel Stacey, Daly Mark. The genome analysis toolkit: a mapreduce framework for analyzing next-generation dna sequencing data. Genome research. 2010;20(9):1297–1303. doi: 10.1101/gr.107524.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Neale Michael, Cardon Lon. Methodology for genetic studies of twins and families. Springer Science & Business Media; 1992. [Google Scholar]
- 38.Ng Pauline C, Henikoff Steven. SIFT: Predicting amino acid changes that affect protein function. Nucleic acids research. 2003;31(13):3812–3814. doi: 10.1093/nar/gkg509. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Peloso Gina M, Auer Paul L, Bis Joshua C, Voorman Arend, Morrison Alanna C, Stitziel Nathan O, Brody Jennifer A, Khetarpal Sumeet A, Crosby Jacy R, Fornage Myriam. Association of low-frequency and rare coding-sequence variants with blood lipids and coronary heart disease in 56,000 whites and blacks. The American Journal of Human Genetics. 2014;94(2):223–232. doi: 10.1016/j.ajhg.2014.01.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Price Alkes L, Kryukov Gregory V, de Bakker Paul IW, Purcell Shaun M, Staples Jeff, Wei Lee-Jen, Sunyaev Shamil R. Pooled association tests for rare variants in exon-resequencing studies. The American Journal of Human Genetics. 2010;86(6):832–838. doi: 10.1016/j.ajhg.2010.04.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Pritchard Jonathan K. Are rare variants responsible for susceptibility to complex diseases? The American Journal of Human Genetics. 2001;69(1):124–137. doi: 10.1086/321272. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Purcell Shaun M, Moran Jennifer L, Fromer Menachem, Ruderfer Douglas, Solovieff Nadia, Roussos Panos, O'Dushlaine Colm, Chambert Kimberly, Bergen Sarah E, Kähler Anna. A polygenic burden of rare disruptive mutations in schizophrenia. Nature. 2014;506(7487):185–190. doi: 10.1038/nature12975. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Ripke Stephan, O'Dushlaine Colm, Chambert Kimberly, Moran Jennifer L, Kähler Anna K, Akterin Susanne, Bergen Sarah E, Collins Ann L, Crowley James J, Fromer Menachem. Genome-wide association analysis identifies 13 new risk loci for schizophrenia. Nature genetics. 2013;45(10):1150–1159. doi: 10.1038/ng.2742. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Samocha Kaitlin E, Robinson Elise B, Sanders Stephan J, Stevens Christine, Sabo Aniko, McGrath Lauren M, Kosmicki Jack A, Rehnström Karola, Mallick Swapan, Kirby Andrew. A framework for the interpretation of de novo mutation in human disease. Nature genetics. 2014;46(9):944–950. doi: 10.1038/ng.3050. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Schork Nicholas J, Murray Sarah S, Frazer Kelly A, Topol Eric J. Common vs. rare allele hypotheses for complex diseases. Current opinion in genetics & development. 2009;19(3):212–219. doi: 10.1016/j.gde.2009.04.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Speed Doug, Hemani Gibran, Johnson Michael R, Balding David J. Improved heritability estimation from genome-wide snps. The American Journal of Human Genetics. 2012;91(6):1011–1021. doi: 10.1016/j.ajhg.2012.10.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Svishcheva Gulnara R, Belonogova Nadezhda M, Axenovich Tatiana I. Ffbskat: fast family-based sequence kernel association test. PloS one. 2014;9(6):e99407. doi: 10.1371/journal.pone.0099407. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Teslovich Tanya M, Musunuru Kiran, Smith Albert V, Edmondson Andrew C, Stylianou Ioannis M, Koseki Masahiro, Pirruccello James P, Ripatti Samuli, Chasman Daniel I, Willer Cristen J. Biological, clinical and population relevance of 95 loci for blood lipids. Nature. 2010;466(7307):707–713. doi: 10.1038/nature09270. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Venables William N, Ripley Brian D. Modern applied statistics with S. Springer Science & Business Media; 2002. [Google Scholar]
- 50.Wu Michael C, Lee Seunggeun, Cai Tianxi, Li Yun, Boehnke Michael, Lin Xihong. Rare-variant association testing for sequencing data with the sequence kernel association test. The American Journal of Human Genetics. 2011;89(1):82–93. doi: 10.1016/j.ajhg.2011.05.029. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Xu ChangJiang, Tachmazidou Ioanna, Walter Klaudia, Ciampi Antonio, Zeggini Eleftheria, Greenwood Celia MT. Estimating genome-wide significance for whole-genome sequencing studies. Genetic epidemiology. 2014;38(4):281–290. doi: 10.1002/gepi.21797. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Yang Jian, Hong Lee S, Goddard Michael E, Visscher Peter M. GCTAa: a tool for genome-wide complex trait analysis. The American Journal of Human Genetics. 2011;88(1):76–82. doi: 10.1016/j.ajhg.2010.11.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Zeng Ping, Zhao Yang, Liu Jin, Liu Liya, Zhang Liwei, Wang Ting, Huang Shuiping, Chen Feng. Likelihood ratio tests in rare variant detection for continuous phenotypes. Annals of Human Genetics. 2014;78(5):320–332. doi: 10.1111/ahg.12071. [DOI] [PubMed] [Google Scholar]
- 54.Zhan Xiaowei, Larson David E, Wang Chaolong, Koboldt Daniel C, Sergeev Yuri V, Fulton Robert S, Fulton Lucinda L, Fronick Catrina C, Branham Kari E, Bragg-Gresham Jennifer. Identification of a rare coding variant in complement 3 associated with age-related macular degeneration. Nature genetics. 2013;45(11):1375–1379. doi: 10.1038/ng.2758. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Zuk Or, Schaffner Stephen F, Samocha Kaitlin, Do Ron, Hechter Eliana, Kathiresan Sekar, Daly Mark J, Neale Benjamin M, Sunyaev Shamil R, Lander Eric S. Searching for missing heritability: designing rare variant association studies. Proceedings of the National Academy of Sciences. 2014;111(4):E455–E464. doi: 10.1073/pnas.1322563111. [DOI] [PMC free article] [PubMed] [Google Scholar]