Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 Jun 1.
Published in final edited form as: Genet Epidemiol. 2020 Feb 26;44(4):352–367. doi: 10.1002/gepi.22287

Convex Combination Sequence Kernel Association Test for Rare Variant Studies

Daniel Posner 1, Honghuang Lin 2,3, James B Meigs 4, Eric D Kolaczyk 5, Josée Dupuis 1,2
PMCID: PMC7205561  NIHMSID: NIHMS1574897  PMID: 32100372

Abstract

We propose a novel variant set test for rare-variant association studies that leverages multiple SNV annotations. Our approach optimizes a convex combination of different Sequence Kernel Association Test (SKAT) statistics, where each statistic is constructed from a different annotation and combination weights are optimized through a multiple kernel learning algorithm. The combination test statistic is evaluated empirically through data-splitting. In simulations, we find our method preserves type I error at α = 2.5 × 10−6 and has greater power than SKAT(-O) when SNV weights are not misspecified and sample sizes are large (N ≥ 5000). We utilize our method in the Framingham Heart Study (FHS) to identify SNV sets associated with fasting glucose. While we are unable to detect any genome-wide significant associations between fasting glucose and 4kb windows of rare variants (p < 10−7) in 6,419 FHS participants, our method identifies suggestive associations between fasting glucose and rare variants near ROCK2 (p = 2.1 × 10−5) and within CPLX1 (p = 5.3 × 10−5). These two genes were previously reported to be involved in obesity mediated insulin resistance and glucose-induced insulin secretion by pancreatic beta-cells, respectively. These findings will need to be replicated in other cohorts and validated by functional genomic studies.

Keywords: fasting glucose, rare variant association study, SKAT, convex optimization

1. Introduction

Many complex traits are heritable, but the exact genetic causes are difficult to determine. A common method for disentangling the effects of different genetic factors is to perform a “genome-wide association study” (GWAS), where each single nucleotide variant (SNV) is tested for association with a trait of interest. However, the minor allele frequency (MAF) of a variant can strongly influence the power of a single variant test, resulting in low power to detect the effects of rare (MAF ≤ 0.5%) and low-frequency (0.5% < MAF ≤ 5%) SNVs. For this reason, rare variants are aggregated into SNV-sets to increase the cumulative (or combined) MAF of all variants being tested, thereby improving power to detect a joint association between the trait and SNVs in the set.

A wide range of SNV-set tests have been proposed for rare variant association studies. Broadly speaking, they can be categorized as methods that combine SNVs (Li and Leal, 2008; Morgenthaler and Thilly, 2007; Madsen and Browning, 2009) or methods that combine marginal test statistics (or p-values) for each SNV (Conneely and Boehnke, 2007; Wu et al., 2011; Zhan et al., 2017b; Barnett et al., 2017; Sun et al., 2019; Liu et al., 2019). Methods that combine SNVs, or burden tests, evaluate the association between a trait and a weighted sum of SNVs (or burden score). Burden tests have poor power when SNVs have different directions of effect, leading to the adoption of methods that combine marginal test statistics and are powerful for testing a mix of protective and deleterious SNVs.

Statistical power of variant set tests can be improved by weighting SNVs by their hypothesized effect on the trait or disease risk as a function of available annotation, e.g. a function of MAF giving more weight to rarer SNVs. Fixed SNV weights, however, may misspecify the contribution of the variants and lower the power of the variant set test (Minica et al., 2017). To reduce weight misspecification, adaptive tests have been proposed that compute many different statistics and select the test with the smallest p-value, such as the Multi-kernel Sequence Kernel Association Test (MK-SKAT) (Wu et al., 2013; Urrutia et al., 2016) or the omnibus test statistic (OMNI) (Barnett et al., 2017). Other adaptive approaches optimize a combination of test statistics, such as the optimal unified SKAT (SKAT-O) (Lee et al., 2012), which finds the best convex combination of two SKAT statistics with different SNV weights. The first approach ignores complementary information from annotations that are not selected, while the second approach is restricted to two weighting schemes.

In this paper, we present a test which is a convex combination of any number of SKAT statistics. Our method optimizes composite SNV weights from multiple annotations, such that SNVs unrelated to the trait (through the annotations) are assigned low weight and are effectively excluded from the SNV-set test. Weighted-kernel averaging of SKAT statistics is not novel, and was originally proposed by Wu et al. (Wu et al., 2013). The Wu et al. approach sets kernel weights a priori, such as assigning equal weight to each candidate kernel. Our proposed method, on the other hand, adaptively estimates kernel weights from the data. We compare both approaches–fixed equal weights and adaptive weights–in simulations.

Another concern in rare variant analysis is the choice of SNV to include in the SNV-set, which is critical to the power of the association test. If the SNV-set is chosen poorly, the association signals from causal SNVs within the set are diluted by SNVs in the set that are unrelated to the trait. Genes or gene sets (e.g. pathways) are natural SNV-sets, but no such organizing principle exists for SNVs located outside of genes, called “intergenic” SNVs. We adopt the approach of a previous analysis, where intergenic SNVs were aggregated within 4000 base-pair (4kb) windows with 50% overlap (Morrison et al., 2013, 2017), and further screen SNVs using annotations with potential biological relevance to fasting metabolism (Goldstein and Hager, 2015).

2. Methods

Our method is built on SKAT (Wu et al., 2011), a variance component test for the association between a set of SNVs and a trait. We briefly describe the SKAT approach for testing SNV-sets. We then introduce our proposed method, cSKAT, to find the optimal convex combination of candidate SKAT statistics. Because of the optimization involved, the cSKAT statistic null distribution must be assessed empirically. We offer a method for doing so, based on data splitting. We also present a biologically informed approach to construct candidate kernels.

2.1. Convex-optimized SKAT (cSKAT)

Let y a (1) vector of trait values for n subjects; X is a (n × d) design matrix of non-genetic covariates and β is a (d × 1) vector of non-genetic effects, both including an intercept; G is a (n × m) matrix of SNV genotypes or dosages; and G is the (m × 1) genotype vector for the ith subject, Gij is the ith subject’s genotype for the jth variant (0 ≤ Gij ≤ 2). Assume a generalized linear mixed effects model relating trait to SNV genotypes:

g(E(y))=Xβ+h (1)

where the link function g is the identity link for continuous traits or logit link for binary traits and h = (h(G1.), …, h(G))T is an (n × 1) vector for the genetic effect on the subject’s trait, and function h(·) lies in a functional space generated by a positive-semidefinite kernel function k(·, ·) that satisfies Mercer’s condition (Cristianini and Shawe-Taylor, 2000). The kernel function, k(Gi., Gj.), measures similarity between the ith and jth subjects based on their SNV genotypes in the SNV-set.

When proposing SKAT, Wu et. al assumed that h is distributed N(0, τK), where τ is a variance component indexing the effect of the SNV-set and K is a known kernel matrix with entries defined by a kernel function Kij = k(G, G). SKAT (Wu et al., 2011) is a test of the null hypothesis for the SNV effects, H0 : τ = 0, using the following statistic:

Q=(yy^0)TK(yy^0)ϕ^0 (2)

where y^0=[g1(β^0TX1),,g1(β^0TXn)] is the predicted trait from non-genetic covariates and ϕ^0 β0 are maximum likelihood estimates of dispersion parameter and non-genetic effects under H0, respectively. When y is continuous, ϕ^0=σ^02 is the residual variance of y after accounting for non-genetic covariates, and when y is binary, ϕ^0=1

Here we embed functional genomic elements, or annotations, directly in SKAT and weight the annotations based on their potential relevance to the trait. Given L candidate annotations, let Ql and Kl be the lth candidate SKAT statistic and kernel matrix, and γl be the convex weight such that {γ:l=1Lγl=1,γ0}. The convex SKAT (cSKAT) statistic is defined as a convex combination of candidate SKAT statistics:

Qγ=l=1LγlQl=(yy^0)T(l=1LγlKl)(yy^0)ϕ^0. (3)

Hence, the cSKAT statistic is defined through a convex combination of kernels, {K:K=l=1LγlKl,l=1Lγl=1,γ0}. We describe how to construct these kernels from functional genomic annotation in Section 2.3, how to estimate the convex weights in Appendix A.1, and how to evaluate the test statistic null distribution below.

When the combination weights, γ, are fixed or optimized on an independent set of data, the null distribution of Q is a weighted sum of independent χ2 variables, j=1Jλjχ12, where λj are eigenvalues of (1/ϕ^0)P12(l=1LγlKl)P12, P = VVX(XTVX)−1XTV is the variance of residuals (yy^0), and V=σ^02In for continuous traits and In is an (n × n) identity matrix or V=diag[y^01(1y^01),,y^0n(1y^0n)] for binary traits and y^0i=logit1(β^0TXi) is the estimated probability that subject i is a case under H0. Asymptotic p-values can be computed analytically with the Davies method (Davies, 1980) or approximated with high accuracy with the saddlepoint method (Kuonen, 1999).

When the combination weights, γ, are optimized from the same data used for the test, Q can be evaluated through permutation testing. In a permutation test, the test statistic null distribution would be approximated by fully resampling the observed traits without replacement (i.e. permutation) and recomputing the test statistic for each permutation of trait values. Permutations are computationally burdensome and are difficult to implement for dependent individuals, such as relatives in the Framingham Heart Study. Due to these limitations, we instead use (single) sample-splitting in our simulations and analysis, where weights are estimated in a subset of individuals and the tests are performed in the remaining individuals. Multiple sample splits may be used to improve power and reproducibility (Meinshausen et al., 2009).

2.2. SNV Annotations

In our analyses, we use four classes of annotation: SNV MAF and three ENCODE annotations (ENCODE Project Consortium et al., 2012), which include signals of functional genomic elements along the genome (see Table 1).

Table 1:

SNV Annotations

Class (l) # Features Type [Min, Max] Source
Open Chromatin 1 continuous [0, 1000] ENCODE
Transcription Factors 11 continuous [0, 1000] ENCODE
Histone Modifications 2 continuous [0, 1000] ENCODE
SKAT MAF weight 1 continuous [0, 25] fBeta(1,25)(MAF)

Each ENCODE signal (scaled 0–1000) is derived from chromatin immunoprecipitation sequencing (ChIP-seq) of a specific DNA-binding element in a specific cell type. For example, the transcription factor Forkhead box protein A2 (FOXA2) has a non-zero number of reads mapping to genomic regions in red blood cells, cancer cells, and other cell types. Read counts at each genomic locus are normalized, compared against the null distribution, and transformed into false discovery rates (q-values). The signals provided by ENCODE are q-values rescaled to 0–1000 to facilitate visualization.

For each functional genomic element, such as FOXA2, we take the maximum signal over all cell types relevant to a trait. For fasting glucose, we use the maximum FOXA2 signal at each genomic location in all available red blood cells, β-cells (if available), and white blood cells. We call this FOXA2 signal vector an annotation “feature”. We call the collection of all transcription factors (TFs) an annotation “class”. Only transcription factors and histone modifications related to fasting metabolism (Goldstein and Hager, 2015) are included in our rare-variant association study of fasting glucose. We construct one kernel for each class from features in Table 2.

Table 2:

SNV Annotation Functions

Class Feature Function (abbreviated)
Open Chromatin DNase-seq Peaks Indicator of regions accessible for transcription
TF CEBP-β Gluconeogenesis
EGR1 Induces CEBP-α when activated by glucagon
ERRα Gluconeogenesis, fatty acid metabolism
FOXA2 Gluconeogenesis, fatty-acid oxidation (FAO), ketogenesis
GR Induces genes encoding fasting-related transcription factors
HNF4α Maturity-onset Type 1 diabetes, gluconeogenesis
NRF1 Links transcription of metabolic genes to cellular growth
P300 Interacts with PPARγ (regulator of glucose metabolism)
PGC-1α Regulates energy metabolism genes
SREBP-1,2 Lipid homeostasis
TR Responsible for many metabolic functions of thyroid hormone
HM H3K9Ac Highly correlated with active promoters
H3K36me3 Represses aberrant transcription, involved in denning exons
Minor Allele Frequency Minor Allele Frequency Rare SNVs are more likely to be causal (due to natural selection)

2.3. Specification of Kernel Matrices

Any positive semidefinite kernel can be specified for K, though in most rare variant studies, the weighted linear kernel is used. As its name suggests, the weighted linear kernel rescales each subject’s genotype vector by fixed weights, and its entries are dot products of these weighted genotypes. Let wkl be the sum of features in annotation class l at SNV k (normalized to the unit interval) and Gik be the ith subject’s dosage of the kth SNV (0 ≤ Gik ≤ 2). The lth weighted linear kernel function for subjects i and j is:

(Kl)ij=k=1mwkl2GikGjk. (4)

Optimal SNV weights for traits are unknown, so investigators use estimates based on allele frequencies (i.e. rare alleles are more likely causal due to natural selection) or predicted functional consequence scores derived from functional genomic elements, such as transcription factors. In rare variant studies, the most commonly used weight is the Beta(1,25) density evaluated at the SNV MAF, wk = fBeta(1,25)(MAFk) (Wu et al., 2011). A recent study has also used functional impact scores from bioinformatics tools (Morrison et al., 2017).

In our extension of SKAT, we find better SNV weights for a trait by optimizing the kernel. We consider a class of composite kernels, {K:K=l=1LγlKl,l=1Lγl=1,γ0}, from which to select an optimal kernel for the trait. The convex combination weights are optimized through centered kernel-target alignment (Cortes et al., 2012) to emphasize only annotation classes that are potentially relevant to the trait (see Appendix A.1). Before optimization, all base kernels are trace-normalized and centered by pre- and post-multiplying by an (n × n) centering matrix, Cn=(In1n1nTn), 1n is an (n × 1) vector of 1’s:

K=CnK~Cntr(CnK~Cn) (5)

where K~ is a raw kernel matrix and K is the trace-normalized and centered kernel. When all candidate kernels are weighted linear kernels, optimizing the kernel combination is equivalent to optimizing SNV weights (w = [w1, w2, …, wm]) from convex combinations of annotations {w:wk=l=1Lγlwkl,l=1Lγl=1,γ0}. To ensure γ is interpretable, annotations of each class are normalized to the unit interval.

2.4. Type I Error and Power

We perform simulations to evaluate Type I error and compare power of our proposed test (cSKAT) and four versions of SKAT: unweighted linear combination SKAT (i.e. a sum of SKAT statistics computed separately with one annotation) (Wu et al., 2013), SKAT with ideal weights equal to SNP effect sizes, and SKAT (Wu et al., 2011) and SKAT-O (Lee et al., 2012) with weights as a function of MAF only. We also evaluate the power of a Cauchy combination test or ACAT (Liu et al., 2019) that is a combination of p-values from SKAT tests for each annotation, separately. MK-SKAT (Wu et al., 2013; Urrutia et al., 2016) software has yet to be released and, to our knowledge, is not computationally feasible for these simulations. For cSKAT, we create candidate kernels from MAF and three annotation classes from ENCODE: open chromatin (OC), transcription factors (TF), and histone modification (HM). Annotations used in each test are presented in Table 3.

Table 3:

Tests Compared

Test Annotation used
ACAT OC, TF, HM, MAF
cSKAT (proposed) OC, TF, HM, MAF
cSKAT, restricted to a subset of annotations OC, MAF
SKAT, unweighted linear combination OC, TF, HM, MAF
SKAT (ideal weights) data-generating annotation
SKAT MAF
SKAT-O MAF

We simulate whole genomes for subjects with the software HAPGEN2 using reference genomes of European ancestry from the 1000 Genomes Project. We adopt the SNV test aggregation of intergenic regions from a previous analysis, where intergenic SNVs were grouped within 4000 base-pair (4kb) windows (Morrison et al., 2013, 2017). The tests are performed for each window with observed cumulative minor allele count (MAC) greater than 20 and evaluated at multiple type I error levels (α).

To assess type I error (α), we run 1,000 simulations with 1,000 subjects whose trait is generated from a standard normal distribution yi~iidN(0,1). In each simulation, we test 20,000 windows, using 500 subjects for optimizing the cSKAT weights (N0 = 500) and the other 500 subjects for testing at level α (N1 = 500). Because the weights are optimized on a subset of individuals who are independent from individuals used for hypothesis testing, p-values are computed from the SKAT null distribution Q~H0j=1Jλjχ12, where λj are eigenvalues of (1/ϕ^0)P12(l=1LγlKl)P12, instead of the permutation distribution.

To compare statistical power of the test statistics, we run 100 simulations using 10,000 subjects in 54 windows (of length 4kb) for different trait-generating models. The windows selected for power simulations satisfy several criteria:

  1. Over half of the SNVs have 2 or fewer non-zero annotations

  2. All annotations are present and vary across the window

  3. Number of SNVs ≥ 5

  4. At least one SNV has unique annotation (i.e. ≥ 1 SNV with OC-only, 1 SNV with TF-only, or ≥ 1 SNV with HM-only)

The first condition implies some degree of orthogonality between annotation classes. In windows with highly correlated annotations, estimated weights are unstable and difficult to interpret. The other criteria ensure a diverse set of weights and causal SNVs are included in simulations. All annotation classes must be present in a window to simulate equal class weights. When 20% of SNVs are causal, at least 5 SNVs are needed for one causal SNV. The unique annotation condition ensures less abundant annotations are well-represented in the simulations and do not always coincide with more abundant annotations.

We evaluate the power of the cSKAT statistic given various sample sizes for estimation and testing. Power for SKAT and SKAT-O are evaluated on the full sample in each simulation. Let w~kl be the sum of all annotations of class l for SNV k normalized to the unit interval. For each window and simulation γ, we select 20% of SNVs as causal based on annotation, P(SNV k is causal )=l=14γlw~kl/k=1ml=14γlw~kl. We simulate a continuous trait for each simulation with a simple linear model:

y=k=1m~βkgk+ε (6)

where m~ is the number of causal variants in the window, βk is the effect of the kth causal SNV specified as βk=l=14γlw~kl, and random error ε~N(0,σe2) where σe2 is fixed so that SNVs explain 1% of the trait variance, Rwindow2=1%. Note that cSKAT weights γ^ are estimated on the kernel-level and differ from the simulation model γ, which are on the scale of the original data.

We also evaluate the robustness of our approach to partial and complete misspecification of SNV weights. Let γmissp be the degree of misspecification ranging from 0 to 1. We select γmissp×m~ causal variants randomly (without regard for annotations) and assign them random uniform effect sizes βmissp ~ U(0, 1). The remaining variants are selected and weighted for annotation exactly the same as in Equation (6). In these simulations, we define partial misspecification as γmissp = 0.5 and complete misspecification as γmissp = 1.

SNV annotations wk are fixed (see Table 2), while annotation class weights γ are varied according to Table 4. In the first scenario, all annotation classes have equal weight: 0.25 to open chromatin (OC), 0.25 to transcription factors (TF), 0.25 to histone modification (HM), and 0.25 to a function of MAF. We assign equal weight to OC and TF in the second scenario (γOC = γTF = 0.5), assign all weight to TF in the third scenario (γTF = 1), and assign all weight to a function of MAF in the last scenario (γMAF = 1).

Table 4:

Power Simulation Parameters

Rwindow2 γOC γTF γHM γMAF γmissp
0.25 0.25 0.25 0.25 0
1% 0.5 0.5 0 0 0
0 1 0 0 0
0 0 0 1 0
0 0 0 0 1
1% 0 0.5 0 0 0.5
0 1 0 0 0

To determine the sample size used for estimating weights in power simulations, we compare estimated weights (averaged over 54 windows and 100 simulations per window) across multiple sample sizes (N0 = 100, 200, …, 2000). Let γ^l,n be the average estimated weight for annotation l in sample size n and denote the (absolute) difference between weights estimated at consecutive sample sizes δn=l=14|γ^l,nγ^l,n200|. We compare cSKAT power with two different estimation sample sizes, N0 based on criteria δn ≤ 0.1 and δn ≤ 0.05.

2.5. Analysis in the Framingham Heart Study

We applied our method to data from the Framingham Heart Study (FHS), an ongoing longitudinal cohort study with detailed medical history, physical examinations, and medical tests (Dawber et al., 1951). The first 5209 FHS participants, called the “Original Cohort”, were recruited in 1948. In 1971, a second cohort (“Offspring”) of 5124 participants was recruited from offspring of the Original Cohort and their spouses (Kannel et al., 1979). Finally, the Third Generation Cohort (“Gen III”) consists of 4095 grandchildren of the Original Cohort and children of Offspring Cohort spouses whose parents were not in the Original Cohort (Splansky et al., 2007). While originally developed as a cardiovascular cohort study, the FHS includes many other traits, such as fasting glucose and various cancers. In our analysis, we tested associations between ≥8-hour fasting glucose and SNVs in genes or intergenic (4kb)windows.

We used genetic and trait data for 6419 diabetes-free participants from the Offspring Cohort at exam 5 and Third Generation Cohort at exam 1. Fasting glucose residuals were computed within each sex and cohort by regressing fasting glucose on age and age squared.

We constructed weighted linear kernels from each annotation class in Table 2. All features within each class were summed and resulting SNV weights were normalized to the unit interval. When applying our method, we estimated convex weights in an unrelated subset of individuals (n=1814) and used the remaining individuals (n=4605) to test the association between fasting glucose and SNVs within genes and intergenic windows. A modified SKAT statistic, famSKAT (Chen et al., 2013), was used in the association test to account for relatedness between FHS participants. We also performed SKAT-O in the full set of individuals (n=6419). All analyses were run in R version 3.4.3 (R Core Team, 2019) with the seqMeta package (Voorman et al., 2013), which implements the famSKAT method to account for relatedness between participants.

3. Results

3.1. Simulation Results

Using data-splitting, type I error (α) of cSKAT was controlled at all levels but 0.05 and was slightly conservative at type I error levels below 0.005 (see Table 5). Inflation at α = 0.05 may be due to small sample size. The null distribution should be evaluated through permutation testing when possible to correct for this departure from the nominal significance level.

Table 5:

Type I Error for cSKAT

α Observed Type I Error 95% CI
0.05 0.05163 (0.05153, 0.05173)
0.005 0.00501 (0.00498, 0.05040)
0.001 0.00097 (0.00096, 0.00098)
5 × 10−4 4.9 × 10−4 (4.8 × 10−4, 5.0 × 10−4)
1 × 10−4 9.2 × 10−5 (8.8 × 10−5, 9.7 × 10−5)
1 × 10−5 9.6 × 10−6 (8.3 × 10−6, 1.1 × 10−5)
2.5 × 10−6 2.4 × 10−6 (1.7 × 10−6, 3.1 × 10−6)

Figure 1 is a plot of estimated cSKAT weights at different sample sizes for each simulation scenario. Note that cSKAT weights γ^ are estimated from the variance component model used for SKAT which diėrs from the simulation model, and consequently γ^ do not converge to γ. In all simulation scenarios, estimated cSKAT weights γ^ converged within 1000 samples for criteria δn ≤ 0.05 and within 600 samples for criteria δn ≤ 0.1.

Figure 1:

Figure 1:

Annotation weights (γ^) estimated by cSKAT at sample sizes N0 = 100, 200, …, 2000. We set 20% of SNVs to be causal. Causal SNVs explain 1% of trait variance (R2 = 1%). In the top panel, all annotation types contributed equally to the effect size of causal SNVs, i.e. the effect size for a causal SNV is the mean of all (standardized) annotations for the SNV. In the middle panel, causal SNV effect size is a mean of OC and TF (but no other) annotations. In the bottom panel, effect size is simply the standardized TF annotations.

Figure 2 displays empirical power for cSKAT, SKAT, and SKAT-O computed at α = 108, averaged over the 54 windows and 100 simulations per window. Power was evaluated for sample sizes N=200 to 2000 (by 100), N=2000 to 5000 (by 500), and N=5000 to 10000 (by 1000) with estimation subset N0 withheld from the cSKAT test. For most sample sizes (N ≤ 8000), cSKAT had greater power for the smaller estimation subset (N0 = 600) than the larger estimation subset (N0 = 1000), indicating a preference for test sample size (N1) over optimality of weights γ^. Under all simulated scenarios, cSKAT with N0 = 600 had greater power than SKAT and SKAT-O in moderately large samples (N ≥ 5000). For smaller samples (N ≤ 4000), cSKAT was less powerful than SKAT and SKAT-O due to sample loss from data splitting. Power for cSKAT improved when annotation weights were more concentrated, with up to 15% higher power than SKAT-O when transcription factors had a weight of 1.

Figure 2:

Figure 2:

Power comparison of cSKAT (proposed), SKAT, and SKAT-O. Empirical power is computed at α = 1 × 108 and averaged over 54 windows, 100 simulations per window. We set 20% of SNVs to be causal. Causal SNVs explain 1% of trait variance (R2 = 1%). In the top panel, all annotation types contributed equally to the effect size of causal SNVs, i.e. the effect size for a causal SNV is the mean of all (standardized) annotations for the SNV. In the middle panel, causal SNV effect size is a mean of OC and TF (but no other) annotations. In the bottom panel, effect size is simply the standardized TF annotation.

Figure 3 is a comparison of statistical power for different methods (see Table 3). Power was computed at α = 108, averaged over the 54 windows and 100 simulations per window. Power was evaluated for sample sizes N=3000, 5000, and 7000. Our proposed cSKAT approach was more powerful than the unweighted combination SKAT and had comparable power to ACAT in large samples (N > 7000). In moderately large samples (N > 5000), cSKAT was more powerful than unweighted combination SKAT when the only causal annotation was MAF-based. For smaller samples (N < 5000), cSKAT was less powerful than unweighted combination SKAT and ACAT potentially due to reduced sample size from sample-splitting.

Figure 3:

Figure 3:

Statistical power of ACAT, cSKAT (proposed), cSKAT restricted to two annotations (OC, MAF), unweighted combination SKAT, SKAT with ideal weights, and SKAT and SKAT-O with the standard MAF weights derived from a beta distribution. Empirical power is computed at α = 1 [notdef] 108 for different sample sizes (N=3000, 5000, 7000) and averaged over 54 windows, 100 simulations per window. A total of 600 samples were used to optimize cSKAT weights, with the remaining samples used for the association test (N=2400, 4400, or 6400, respectively). We set 20% of SNVs to be causal. Causal SNVs explain 1% of trait variance (R2 = 1%). In the left-most panel, all annotation types contributed equally to the effect size of causal SNVs, i.e. the effect size for a causal SNV is the mean of all (standardized) annotations for the SNV. In the panel second from the left, causal SNV effect size is a mean of OC and TF (but no other) annotations. In the third panel, effect size is simply the standardized TF annotation. In the right-most panel, SNV effect size is the beta(1,25) density evaluated at the SNV MAF.

The ACAT approach, which is a combination of p-values from SKAT tests for each annotation, was more powerful than cSKAT and SKAT tests in most scenarios, almost reaching the power of SKAT with ideal weights (an upper bound on statistical power). In the scenario where the only causal annotation was MAF-based, however, the standard SKAT and SKAT-O with only MAF-based annotation had greater power than ACAT and cSKAT.

3.2. Results in FHS

Our cSKAT test had a low genomic inflation factor (λGC = 1.037) comparable to the SKAT-O test (λGC = 1.039). The Q-Q plots (see Figure 4) indicate the estimation and test subsets were sufficiently independent for cSKAT.

Figure 4:

Figure 4:

Q-Q plots for (a) cSKAT (proposed) and (b) SKAT-O in the rare variant analysis on fasting glucose. Minus log base 10 of p-values are plotted. Genomic control factor λGC is the ratio of median observed χ12 (converted from median p-value) and median expected χ12. 45-degree line is λGC = 1 with shaded 95% CI

Due to small sample size in FHS (test subset n=4605), we found no genome-wide significant associations (p < 107) between fasting glucose and the tested regions (see Figure 5). However, two of the top cSKAT associations had potential biological connections to fasting glucose and were undetected by SKAT-O (see Table 6). The strongest association was in chromosome 2 for a region within 20kb of ROCK2 (cSKAT p = 2.11 × 105, SKAT-O p = 0.10), which has been shown to induce obesity mediated insulin resistance and cardiac dysfunction (Soliman et al., 2015). In this region near ROCK2, the estimated annotation weights were 1 for transcription factors and 0 for all other annotation classes, suggesting the region may have a regulatory effect on ROCK2. The second highest association was found in the gene CPLX1 (cSKAT p = 5.26 × 105, SKAT-O p = 0.39), which has previously been implicated in glucose-induced secretion of insulin by pancreatic beta-cells (Abderrahmani et al., 2004). The estimated annotation weights in CPLX1 were large for histone modification (0.296) and minor allele frequency (0.642). The other top associations had no biological connection to fasting glucose.

Figure 5:

Figure 5:

Manhattan plots of the rare-variant association study in FHS using our proposed cSKAT approach. The solid line is the genome-wide significance level (α = 1.47 × 107) and dotted line is a suggestive threshold (α = 104).

Table 6:

Top cSKAT Associations

Chr Mid-bp Nearest Gene Distance from Gene p-value nSNVs γOC γTF γHM γMAF
cSKAT SKAT-O
2 11506210 ROCK2 22 kb 2.1 × 10−5 0.10 7 0 1 0 0
1 804928 CPLX1 0 5.3 × 10−5 0.39 11 0.021 0.041 0.296 0.642
1 222069979 LOC101929771 56 kb 6.1 × 10−5 0.74 16 0 0.316 0.440 0.244
3 50476518 CACNA2D2 0 6.5 × 10−5 0.03 312 0 0 1 0
4 54857928 RPL21P44 5 kb 9.7 × 10−5 0.34 10 0.035 0.458 0 0.506
8 41910691 KAT6A 1 kb 9.9 × 10−5 0.08 6 0 1 0 0

λOC = estimated weight for open chromatin annotation (0 ≤ λ ≤ 1)

Table 7 lists highly annotated SNVs (with cSKAT weight > 40%) in the top two associated windows. Annotations in Table 7 are presented in their original scale (bounded between 0 and 1000). The cSKAT-estimated SNV weights (wcSKAT=l=14γ^lwl) have been rescaled so all SNV weights sum to 1, and represent the proportion of trait variance explained by the window that is attributable to the SNV. In the window near ROCK2, our method attributed 96% of SNV weight to two SNVs enriched for transcription factors and open chromatin. In CPLX1, 41% of trait variance explained by the gene was attributed to one SNV located at a histone modification. A comparison of cSKAT and SKAT-O SNV weights are provided in Appendix A.2.

Table 7:

SNVs with cSKAT Weight > 40%

Chr bp Gene Major/Minor MAF wcSKAT OC TF HM
CEBP-β FOXA2 HNF4α P300 H3K9Ac
2 11506578 ROCK2 T/C 0.0017 0.48 360 968 752 959 1000 128
2 11506743 C/A 0.0022
4 818583 CPLX1 T/A 0.0017 0.41 229 0 0 0 0 922

wcSKAT = estimated weight for SNV (0 ≤ w ≤ 1). Proportion of total SNV weight in the gene attributable to the SNV. The two SNVs in ROCK2 have identical weights because they have the same TF annotation value, and TF had 100% weight in ROCK2 (i.e. wcSKAT = TF value)

Annotations range from 0 (no signal) to 1000 (max signal)

We also compared SKAT and cSKAT weights in G6PC2, a gene with known rare variant associations with fasting glucose (Wessel et al., 2015; Mahajan et al., 2015). Four likely candidates were found to be driving the joint association between fasting glucose and rare variants in G6PC2: rs138726309, rs2232323, rs146779637, and rs2232326. In our FHS analysis, we found all four variants had greater cSKAT weight than the standard SKAT weight derived from MAF and a Beta(1,25) distribution. In particular, the cSKAT weight for rs2232326 (MAF = 0.0016) was three-fold higher than the SKAT weight (wcSKAT = 0.065 vs wSKAT = 0.02).

The computational burden of the optimization step of cSKAT is small relative to the burden of computing SKAT statistics and p-values. In our real data application, weight optimization was completed in 34.4 CPU hours, while association testing required 337.1 CPU hours for 340,136 genes and 4kb sliding windows including a total of 5,442,193 rare SNPs (MAF < 0.05).

4. Discussion

In this paper, we present a novel method, cSKAT, for optimizing the rare variant SKAT statistic over multiple potentially relevant SNV annotations. The method has higher power than SKAT and SKAT-O in large cohorts (N ≥ 5000) when SNV weights are not completely misspecified, and provides interpretable SNV weights that can inform biological functional studies.

In FHS, we find a possible association between fasting glucose and rare variants near ROCK2 (p = 2.1 × 105) and within CPLX1 (p = 5.3 [notdef] 105), genes involved in obesity mediated insulin resistance (Soliman et al., 2015) and glucose-induced insulin secretion by pancreatic beta-cells (Abderrahmani et al., 2004), respectively. In the window near ROCK2, our method assigns 96% of SNV weight to two SNVs at an active transcription factor (TF) binding site. At these highly weighted loci, the strongest TF signal is P300, which interacts directly with ROCK2. ROCK2 regulates the acetyltransferase activity of P300 through phosphorylation (Tanaka et al., 2006). The second largest TF signals, CEBP-β and HNF4-α, are only indirectly related to ROCK2, e.g. ROCK2 knockdown has shown to increase gene expression of CEBPD (Li et al., 2015), which forms heterodimers with CEBP-α. ROCK1, which often shares functions with ROCK2, interacts with factor HNF4-α (Yoshikawa et al., 2015). There is no known link between ROCK2 and the other active transcription factor at this site, FOXA2.

In CPLX1, 41% of trait variance explained by the gene was attributed to one SNV with high values of histone modification H3K9Ac. H3K9Ac serves an important role in transcription and its loss or depletion in promoters can reduce gene expression. In a recent study, investigators hypothesized that H3K9Ac recruits proteins downstream of transcription initiation which are needed for the next step of transcription (Gates et al., 2017). Replication is required to validate these results in other cohorts (using the estimated annotation weights from FHS).

Our method has several limitations. Data-splitting allows us to compute p-values efficiently from the SKAT null distribution but reduces power because samples are excluded from testing. Sample splitting may also result in splits where variants present in one split are unobserved in another split due to low frequency. To address this limitation, we have optimized kernels based on annotations rather than optimizing single-variant weights. If an annotation has a strong biological connection to the trait, we would expect stronger association between annotated SNVs (genome-wide) and the trait. For example, Purcell et al. (2014) show genome-wide sets of annotated variants (e.g. indel and frameshift variants) are enriched for associations with schizophrenia. Meta-analysis can mitigate loss of power due to sample loss, and multiple sample splitting can ensure all SNVs are involved in kernel optimization (Meinshausen et al., 2009). Further simulations will be required to evaluate other kernel optimization schemes for meta-analysis and find the number of sample splits that balances the increase in power against the increase in computation time.

We also optimize a standardized test statistic rather than a p-value. The distribution of SKAT statistics under different kernels is complex and rescaling may not be adequate in some cases. We restrict our application to linear kernels with this limitation in mind. For more complex kernels, we recommend minimizing p-values rather than maximizing a standardized statistic.

Another limitation of our method is a priori selection of annotation. All rare variant tests require a priori specification of annotation, but our method is especially sensitive to choice of annotation. When annotations are too correlated, the optimization problem is not strictly convex and its solution may not be unique (Cortes et al., 2012). Annotations may also be too sparse or completely absent from many variant sets. In both cases, the optimized kernel weights γ would be difficult to interpret. Here, we sidestep the issue by aggregating annotations within biological classes and using only annotations related to fasting metabolism. For more extensive annotations that are highly correlated, we suggest creating orthogonal annotation classes with principal component analysis (Jolliffe and Cadima, 2016). In regions where annotations are too sparse for cSKAT, we instead recommend the standard SKAT approach followed by post-hoc variable selection to prioritize individual rare variants, such as Kernel Iterative Feature Extraction (KNIFE) (He et al., 2016).

We included only four annotation sources in this paper based on their well-documented involvement in regulating fasting metabolism (Goldstein and Hager, 2015), but cSKAT can easily accommodate additional annotations. For example, several schizophrenia studies have shown rare disruptive variants (nonsense, essential splice site or frameshift) substantially increase risk for schizophrenia (Purcell et al., 2014; Singh et al., 2017; Teng et al., 2018). In these studies, separate analyses were conducted for disruptive variants and other rare variant sets. Using cSKAT, the separate analyses could be combined by pooling all SNVs and coding set memberships as binary annotations (0 for exclusion, 1 for inclusion). The optimized cSKAT weights would then enable direct comparisons between disruptive variants and other variant classes. Given rapidly growing and publicly available functional genomic annotations, adaptive annotation weighting is now an invaluable tool for pinpointing the biological mechanisms driving associations between rare variants and complex traits.

5. Acknowledgments

This work was partially supported by NIH grant U01 DK078616. The Framingham Heart Study is conducted and supported by the National Heart, Lung, and Blood Institute (NHLBI) in collaboration with Boston University (Contract No. N01-HC-25195). This manuscript was not prepared in collaboration with investigators of the Framingham Heart Study and does not necessarily reflect the opinions or views of the Framingham Heart Study, Boston University, or NHLBI. Funding for SHARe Affymetrix genotyping was provided by NHLBI Contract N02-HL64278. SHARe Illumina genotyping was provided under an agreement between Illumina and Boston University.

A. Appendix

A.1. cSKAT Optimization

Existing rare variant tests with adaptive annotation selection use the minimum p-value over annotations. Two such tests have been developed for SKAT (Urrutia et al., 2016) and SKAT-O (He et al., 2017), where the test statistic is the minimum p-value among SKAT(-O) statistics computed for each annotation. Denote pQl as the p-value for the SKAT(-O) statistic using the lth annotation. The minP statistic is:

T=minppQ1,pQ2,pQl. (7)

The significance of the T statistic can be evaluated analytically. The minimum p-value approach performs well but scales poorly to combinations of annotation, where p-values must be computed for a grid of combination weights λ and numerical integration is required to compute the p-value of T. On the other hand, maximizing combinations of test statistics without accounting for their p-values results in poor power (Zhan et al., 2017a,b). In our application, adding SNVs to a SNV-set would increase the test statistic but also increase the eigenvalues of the kernel matrix. Intuitively, rescaling kernel test statistics by their kernel matrix eigenvalues could help connect test statistic maximization to p-value minimization. For example, the null distribution of a SKAT statistic rescaled by its eigenvalues can be approximated through the Satterthwaite approach (Lumley, 2011):

Qλγ2~approxaχv2 (8)

where scale parameter a=λγ2λγ1 and degrees of freedom v=(λγ1λγ2)2 are ratios of the l1 and l2 norms of the kernel matrix eigenvalues λγ. Thus, increasing eigenvalues will increase the scale but decrease the degrees of freedom. While we optimize a standardized test statistic, Q/‖λγ2 the distribution of SKAT statistics under different kernels is complex and rescaling may not be adequate in some cases. We restrict our application to linear kernels with this limitation in mind. For more complex kernels, we recommend minimizing p-values rather than maximizing a standardized statistic.

To incorporate multiple sources of annotation, we optimize a convex combination of SKAT statistics, Qγ=l=1LγlQl, where the eigenvalues λ of test statistic Qγ depend on the convex weights γ. We show that maximizing Q/‖λ2 is equivalent to maximizing centered alignment A between trait y and convex combination kernel Kγ for continuous trait with no non-genetic covariates. For continuous trait, the cSKAT statistic can be rewritten as Qγ=σ^02yTPKγPy where projection matrix P = In − X(XT X)1XT. When there are no non-genetic covariates, P is simply a centering matrix (In1n1nTn) and, assuming all candidate kernels are centered, the cSKAT statistic reduces to Qγ=σ^02yTKγy. Let 〈. , .〉F and ∥.∥F denote the Frobenius inner product and norm, and the centered kernel-target alignment be A(yyT,Kγ)=yyT,l=1LγlKlFyyTFKγF Then observe:

Qγλ2yTKγyj=1Jλj2=tr(yyTl=1LγlKl)tr(Kγ2)yyT,l=1LγlKlFyyTFKγF=A(yyT,Kγ). (9)

Hence, maximizing the cSKAT statistic scaled by its eigenvalues is equivalent to maximizing the centered kernel target alignment between trait and convex combination kernel:

arg maxγQγλ2=arg max γA(yyT,Kγ). (10)

When there are non-genetic covariates, the optimal convex weights for a trait maximize the kernel-target alignment between the residuals of trait regressed on non-genetic covariates (e=yy^0) and the centered convex combination kernel (Kγ=l=1LγlKl):

γ=arg maxγeeT,KγFKγF. (11)

Let a be the vector of inner products between residuals and centered candidate kernels, a=(eeT,K1F,,eeT,KlF)T, M denote the matrix of inner products between candidate kernels, i.e. Mjk = 〈Kj, KkF. Then the optimal convex weights, γ = v/∥v∥, are the solution to the following Quadratic Programming (QP) problem:

min v0vTMv2vTa. (12)

A.2. Comparison of SNV weights used in FHS

In Figure 6, we compare cSKAT and SKAT weights for SNVs in the windows included in Table 6. Both cSKAT and SKAT weights were rescaled to [0, 1] to facilitate comparisons. SKAT weights were generally uniform, with differences between cSKAT and SKAT weights driven by extreme cSKAT weights. SNVs with large cSKAT weights generally had large annotation values and a strong association with the trait. In two of the six windows, for example, cSKAT assigned the majority of weight in the window to a few SNVs at a transcription factor binding site and histone modification, respectively (Table 7).

Figure 6:

Figure 6:

Comparison of SNV weights used in FHS. The y-axis represents standard SKAT weight for SNVs (beta density evaluated at MAF), while the x-axis represents estimated SNV weights estimated by our proposed cSKAT method. Genes and windows chosen are the top cSKAT associations presented in Table 6.

A.3. Weight Misspecification

Figure 7 displays empirical power for cSKAT, SKAT, and SKAT-O for different levels of misspecified SNV weights: complete misspecification (γmissp = 1), partial misspecification γmissp = 0.5), and no misspecification (γmissp = 0). When SNV weights were completely misspecified, cSKAT was less powerful than SKAT and SKAT-O due to sample loss from data splitting. On the other hand, cSKAT was robust to partial misspecification, defined as half of causal SNVs being selected randomly (with random uniform effect) and half of causal SNVs being selected from SNVs with non-zero TF annotations. In all misspecification scenarios, N0 = 600 samples were sufficient to estimate cSKAT weights. Using more samples in the estimation subset (N0 = 1000) resulted in lower power because fewer samples were available for testing (N1 = NN0).

Figure 7:

Figure 7:

Power comparison of cSKAT (proposed), SKAT, and SKAT-O when SNV annotations are misspecified. Empirical power computed at α = 1 × 108 and averaged over 54 windows, 100 simulations per window. We set 20% of SNVs to be causal, with (γmissp × 100)% of causal SNVs selected randomly and assigned random effect sizes between 0 and 1 that are sampled from a uniform distribution U(0, 1). Causal SNVs explain 1% of trait variance (R2 = 1%). The top panel shows complete misspecification where all causal SNVs have random uniform effects. The middle panel shows partial misspecification where half of causal SNVs have random uniform effects. The bottom panel shows no misspecification, i.e. a causal SNV effect size is equal to the standardized TF annotation value.

Footnotes

6

Data Accessibility

The Framingham Heart Study data used in this study are available from dbGaP (Study Accession: phs000007.v30.p11).

References

  1. Abderrahmani A, Niederhauser G, Plaisance V, Roehrich M-E, Lenain V, Coppola T, Regazzi R, and Waeber G (2004). Complexin i regulates glucose-induced secretion in pancreatic β-cells. Journal of cell science, 117(11):2239–2247. [DOI] [PubMed] [Google Scholar]
  2. Barnett I, Mukherjee R, and Lin X (2017). The generalized higher criticism for testing snp-set effects in genetic association studies. Journal of the American Statistical Association, 112(517):64–76. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Chen H, Meigs JB, and Dupuis J (2013). Sequence kernel association test for quantitative traits in family samples. Genetic epidemiology, 37(2):196–204. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Conneely KN and Boehnke M (2007). So many correlated tests, so little time! rapid adjustment of p values for multiple correlated tests. The American Journal of Human Genetics, 81(6):1158–1168. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Cortes C, Mohri M, and Rostamizadeh A (2012). Algorithms for learning kernels based on centered alignment. Journal of Machine Learning Research, 13:795–828. [Google Scholar]
  6. Cristianini N and Shawe-Taylor J (2000). An introduction to support vector machines and other kernel-based learning methods. Cambridge university press. [Google Scholar]
  7. Davies RB (1980). The distribution of a linear combination of x2 random variables. Applied Statistics, 29(3):323–333. [Google Scholar]
  8. Dawber TR, Meadors GF, and Moore FE Jr (1951). Epidemiological approaches to heart disease: the framingham study. American Journal of Public Health and the Nations Health, 41(3):279–286. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. ENCODE Project Consortium et al. (2012). An integrated encyclopedia of dna elements in the human genome. Nature, 489(7414):57. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Gates LA, Shi J, Rohira AD, Feng Q, Zhu B, Bedford MT, Sagum CA, Jung SY, Qin J, Tsai M-J, et al. (2017). Acetylation on histone h3 lysine 9 mediates a switch from transcription initiation to elongation. Journal of Biological Chemistry, 292(35):14456–14472. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Goldstein I and Hager GL (2015). Transcriptional and chromatin regulation during fasting–the genomic era. Trends in Endocrinology & Metabolism, 26(12):699–710. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. He Q, Cai T, Liu Y, Zhao N, Harmon QE, Almli LM, Binder EB, Engel SM, Ressler KJ, Conneely KN, et al. (2016). Prioritizing individual genetic variants after kernel machine testing using variable selection. Genetic epidemiology, 40(8):722–731. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. He Z, Xu B, Lee S, and Ionita-Laza I (2017). Unified sequence-based association tests allowing for multiple functional annotations and meta-analysis of noncoding variation in metabochip data. The American Journal of Human Genetics, 101(3):340–352. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Jolliffe IT and Cadima J (2016). Principal component analysis: a review and recent developments. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 374(2065):20150202. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Kannel WB, Feinleib M, McNamara PM, Garrison RJ, and Castelli WP (1979). An investigation of coronary heart disease in families: the framingham offspring study. American journal of epidemiology, 110(3):281–290. [DOI] [PubMed] [Google Scholar]
  16. Kuonen D (1999). Miscellanea. saddlepoint approximations for distributions of quadratic forms in normal variables. Biometrika, 86(4):929–935. [Google Scholar]
  17. Lee S, Emond MJ, Bamshad MJ, Barnes KC, Rieder MJ, Nickerson DA, Christiani DC, Wurfel MM, and Lin X (2012). Optimal unified approach for rare-variant association testing with application to small-sample case-control whole-exome sequencing studies. American Journal of Human Genetics, 91(2):224–237. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Li B and Leal SM (2008). Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. The American Journal of Human Genetics, 83(3):311–321. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Li M, Zhou W, Yuan R, Chen L, Liu T, Huang D, Hao L, Xie Y, and Shao J (2015). Rock2 promotes hcc proliferation by cebpd inhibition through phospho-gsk3β/β-catenin signaling. FEBS letters, 589(9):1018–1025. [DOI] [PubMed] [Google Scholar]
  20. Liu Y, Chen S, Li Z, Morrison AC, Boerwinkle E, and Lin X (2019). Acat: A fast and powerful p value combination method for rare-variant analysis in sequencing studies. The American Journal of Human Genetics, 104(3):410–421. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Lumley T (2011). Complex surveys: a guide to analysis using R, volume 565 John Wiley & Sons. [Google Scholar]
  22. Madsen BE and Browning SR (2009). A groupwise association test for rare mutations using a weighted sum statistic. PLoS genetics, 5(2):e1000384. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Mahajan A, Sim X, Ng HJ, Manning A, Rivas MA, Highland HM, Locke AE, Grarup N, Im HK, Cingolani P, et al. (2015). Identification and functional characterization of g6pc2 coding variants influencing glycemic traits define an effector transcript at the g6pc2-abcb11 locus. PLoS genetics, 11(1):e1004876. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Meinshausen N, Meier L, and Bühlmann P (2009). P-values for high-dimensional regression. Journal of the American Statistical Association, 104(488):1671–1681. [Google Scholar]
  25. Minica CC, Genovese G, Hultman CM, Pool R, Vink JM, Neale MC, Dolan CV, and Neale BM (2017). The Weighting is the Hardest Part: On the Behavior of the Likelihood Ratio Test and the Score Test Under a Data-Driven Weighting Scheme in Sequenced Samples. Twin Res Hum Genet, 20(2):108–118. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Morgenthaler S and Thilly WG (2007). A strategy to discover genes that carry multi-allelic or mono-allelic risk for common diseases: a cohort allelic sums test (cast). Mutation Research/Fundamental and Molecular Mechanisms of Mutagenesis, 615(1):28–56. [DOI] [PubMed] [Google Scholar]
  27. Morrison A, Voorman A, Johnson A, Liu X, Yu J, Li A, Muzny D, Yu F, Rice K, Zhu C, Bis J, Heiss G, Donnell C, Psaty B, Cupples L, Gibbs R, and Boerwinkle E (2013). Whole-genome sequence-based analysis of high-density lipoprotein cholesterol. Nature Genetics, 45(8):7–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Morrison AC, Huang Z, Yu B, Metcalf G, Liu X, Ballantyne C, Coresh J, Yu F, Muzny D, Feofanova E, Rustagi N, Gibbs R, and Boerwinkle E (2017). Practical Approaches for Whole-Genome Sequence Analysis of Heart- and Blood-Related Traits. The American Journal of Human Genetics, 100(2):1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Purcell SM, Moran JL, Fromer M, Ruderfer D, Solovieff N, Roussos P, Odushlaine C, Chambert K, Bergen SE, Kähler A, et al. (2014). A polygenic burden of rare disruptive mutations in schizophrenia. Nature, 506(7487):185. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. R Core Team (2019). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. [Google Scholar]
  31. Singh T, Walters JT, Johnstone M, Curtis D, Suvisaari J, Torniainen M, Rees E, Iyegbe C, Blackwood D, McIntosh AM, et al. (2017). The contribution of rare variants to risk of schizophrenia in individuals with and without intellectual disability. Nature genetics, 49(8):1167. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Soliman H, Nyamandi V, Garcia-Patino M, Varela JN, Bankar G, Lin G, Jia Z, and MacLeod KM (2015). Partial deletion of rock2 protects mice from high-fat diet-induced cardiac insulin resistance and contractile dysfunction. American Journal of Physiology-Heart and Circulatory Physiology, 309(1):H70–H81. [DOI] [PubMed] [Google Scholar]
  33. Splansky GL, Corey D, Yang Q, Atwood LD, Cupples LA, Benjamin EJ, D’Agostino RB Sr, Fox CS, Larson MG, Murabito JM, et al. (2007). The third generation cohort of the national heart, lung, and blood institute’s framingham heart study: design, recruitment, and initial examination. American journal of epidemiology, 165(11):1328–1335. [DOI] [PubMed] [Google Scholar]
  34. Sun R, Hui S, Bader GD, Lin X, and Kraft P (2019). Powerful gene set analysis in gwas with the generalized berk-jones statistic. PLoS genetics, 15(3):e1007530. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Tanaka T, Nishimura D, Wu R-C, Amano M, Iso T, Kedes L, Nishida H, Kaibuchi K, and Hamamori Y (2006). Nuclear rho kinase, rock2, targets p300 acetyltransferase. Journal of Biological Chemistry, 281(22):15320–15329. [DOI] [PubMed] [Google Scholar]
  36. Teng S, Thomson PA, McCarthy S, Kramer M, Muller S, Lihm J, Morris S, Soares D, Hennah W, Harris S, et al. (2018). Rare disruptive variants in the disc1 interactome and regulome: association with cognitive ability and schizophrenia. Molecular psychiatry, 23(5):1270. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Urrutia E, Lee S, Maity A, Zhao N, Shen J, Li Y, and Wu MC (2016). Rare variant testing across methods and thresholds using the multi-kernel sequence kernel association test (MK-SKAT). Statistics and its interface, 8(4):495–505. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Voorman A, Brody J, Chen H, Lumley T, Davis B (2013). seqMeta: an R package for meta-analyzing region-based tests of rare DNA variants. Retrieved from https://CRAN.R-project.org/package=seqMeta
  39. Wessel J, Chu AY, Willems SM, Wang S, Yaghootkar H, Brody JA, Dauriz M, Hivert M-F, Raghavan S, Lipovich L, et al. (2015). Low-frequency and rare exome chip variants associate with fasting glucose and type 2 diabetes susceptibility. Nature communications, 6:5897. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Wu MC, Lee S, Cai T, Li Y, Boehnke M, and Lin X (2011). Rare-variant association testing for sequencing data with the sequence kernel association test. American Journal of Human Genetics, 89(1):82–93. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Wu MC, Maity A, Lee S, Simmons EM, Harmon QE, Lin X, Engel SM, Molldrem JJ, and Armistead PM (2013). Kernel machine snp-set testing under multiple candidate kernels. Genetic epidemiology, 37(3):267–275. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Yoshikawa T, Wu J, Otsuka M, Kishikawa T, Ohno M, Shibata C, Takata A, Han F, Kang YJ, Chen C-YA, et al. (2015). Rock inhibition enhances microrna function by promoting deadenylation of targeted mrnas via increasing paip2 expression. Nucleic acids research, 43(15):7577–7589. [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Zhan X, Plantinga A, Zhao N, and Wu MC (2017a). A fast small-sample kernel independence test for microbiome community-level association analysis. Biometrics, 73(4):1453–1463. [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Zhan X, Zhao N, Plantinga A, Thornton TA, Conneely KN, Epstein MP, and Wu MC (2017b). Powerful genetic association analysis for common or rare variants with high-dimensional structured traits. Genetics, 206(4):1779–1790. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES