Convex Combination Sequence Kernel Association Test for Rare Variant Studies

Daniel Posner; Honghuang Lin; James B Meigs; Eric D Kolaczyk; Josée Dupuis

doi:10.1002/gepi.22287

. Author manuscript; available in PMC: 2021 Jun 1.

Published in final edited form as: Genet Epidemiol. 2020 Feb 26;44(4):352–367. doi: 10.1002/gepi.22287

Convex Combination Sequence Kernel Association Test for Rare Variant Studies

Daniel Posner ¹, Honghuang Lin ^2,³, James B Meigs ⁴, Eric D Kolaczyk ⁵, Josée Dupuis ^1,²

PMCID: PMC7205561 NIHMSID: NIHMS1574897 PMID: 32100372

Abstract

We propose a novel variant set test for rare-variant association studies that leverages multiple SNV annotations. Our approach optimizes a convex combination of different Sequence Kernel Association Test (SKAT) statistics, where each statistic is constructed from a different annotation and combination weights are optimized through a multiple kernel learning algorithm. The combination test statistic is evaluated empirically through data-splitting. In simulations, we find our method preserves type I error at α = 2.5 × 10⁻⁶ and has greater power than SKAT(-O) when SNV weights are not misspecified and sample sizes are large (N ≥ 5000). We utilize our method in the Framingham Heart Study (FHS) to identify SNV sets associated with fasting glucose. While we are unable to detect any genome-wide significant associations between fasting glucose and 4kb windows of rare variants (p < 10⁻⁷) in 6,419 FHS participants, our method identifies suggestive associations between fasting glucose and rare variants near ROCK2 (p = 2.1 × 10⁻⁵) and within CPLX1 (p = 5.3 × 10⁻⁵). These two genes were previously reported to be involved in obesity mediated insulin resistance and glucose-induced insulin secretion by pancreatic beta-cells, respectively. These findings will need to be replicated in other cohorts and validated by functional genomic studies.

Keywords: fasting glucose, rare variant association study, SKAT, convex optimization

1. Introduction

Many complex traits are heritable, but the exact genetic causes are difficult to determine. A common method for disentangling the effects of different genetic factors is to perform a “genome-wide association study” (GWAS), where each single nucleotide variant (SNV) is tested for association with a trait of interest. However, the minor allele frequency (MAF) of a variant can strongly influence the power of a single variant test, resulting in low power to detect the effects of rare (MAF ≤ 0.5%) and low-frequency (0.5% < MAF ≤ 5%) SNVs. For this reason, rare variants are aggregated into SNV-sets to increase the cumulative (or combined) MAF of all variants being tested, thereby improving power to detect a joint association between the trait and SNVs in the set.

A wide range of SNV-set tests have been proposed for rare variant association studies. Broadly speaking, they can be categorized as methods that combine SNVs (Li and Leal, 2008; Morgenthaler and Thilly, 2007; Madsen and Browning, 2009) or methods that combine marginal test statistics (or p-values) for each SNV (Conneely and Boehnke, 2007; Wu et al., 2011; Zhan et al., 2017b; Barnett et al., 2017; Sun et al., 2019; Liu et al., 2019). Methods that combine SNVs, or burden tests, evaluate the association between a trait and a weighted sum of SNVs (or burden score). Burden tests have poor power when SNVs have different directions of effect, leading to the adoption of methods that combine marginal test statistics and are powerful for testing a mix of protective and deleterious SNVs.

Statistical power of variant set tests can be improved by weighting SNVs by their hypothesized effect on the trait or disease risk as a function of available annotation, e.g. a function of MAF giving more weight to rarer SNVs. Fixed SNV weights, however, may misspecify the contribution of the variants and lower the power of the variant set test (Minica et al., 2017). To reduce weight misspecification, adaptive tests have been proposed that compute many different statistics and select the test with the smallest p-value, such as the Multi-kernel Sequence Kernel Association Test (MK-SKAT) (Wu et al., 2013; Urrutia et al., 2016) or the omnibus test statistic (OMNI) (Barnett et al., 2017). Other adaptive approaches optimize a combination of test statistics, such as the optimal unified SKAT (SKAT-O) (Lee et al., 2012), which finds the best convex combination of two SKAT statistics with different SNV weights. The first approach ignores complementary information from annotations that are not selected, while the second approach is restricted to two weighting schemes.

In this paper, we present a test which is a convex combination of any number of SKAT statistics. Our method optimizes composite SNV weights from multiple annotations, such that SNVs unrelated to the trait (through the annotations) are assigned low weight and are effectively excluded from the SNV-set test. Weighted-kernel averaging of SKAT statistics is not novel, and was originally proposed by Wu et al. (Wu et al., 2013). The Wu et al. approach sets kernel weights a priori, such as assigning equal weight to each candidate kernel. Our proposed method, on the other hand, adaptively estimates kernel weights from the data. We compare both approaches–fixed equal weights and adaptive weights–in simulations.

Another concern in rare variant analysis is the choice of SNV to include in the SNV-set, which is critical to the power of the association test. If the SNV-set is chosen poorly, the association signals from causal SNVs within the set are diluted by SNVs in the set that are unrelated to the trait. Genes or gene sets (e.g. pathways) are natural SNV-sets, but no such organizing principle exists for SNVs located outside of genes, called “intergenic” SNVs. We adopt the approach of a previous analysis, where intergenic SNVs were aggregated within 4000 base-pair (4kb) windows with 50% overlap (Morrison et al., 2013, 2017), and further screen SNVs using annotations with potential biological relevance to fasting metabolism (Goldstein and Hager, 2015).

2. Methods

Our method is built on SKAT (Wu et al., 2011), a variance component test for the association between a set of SNVs and a trait. We briefly describe the SKAT approach for testing SNV-sets. We then introduce our proposed method, cSKAT, to find the optimal convex combination of candidate SKAT statistics. Because of the optimization involved, the cSKAT statistic null distribution must be assessed empirically. We offer a method for doing so, based on data splitting. We also present a biologically informed approach to construct candidate kernels.

2.1. Convex-optimized SKAT (cSKAT)

Let y a (n×1) vector of trait values for n subjects; X is a (n × d) design matrix of non-genetic covariates and β is a (d × 1) vector of non-genetic effects, both including an intercept; G is a (n × m) matrix of SNV genotypes or dosages; and G_i· is the (m × 1) genotype vector for the i^th subject, G_ij is the i^th subject’s genotype for the j^th variant (0 ≤ G_ij ≤ 2). Assume a generalized linear mixed effects model relating trait to SNV genotypes:

g (E (y)) = X β + h

(1)

where the link function g is the identity link for continuous traits or logit link for binary traits and h = (h(G_1.), …, h(G_n·))^T is an (n × 1) vector for the genetic effect on the subject’s trait, and function h(·) lies in a functional space generated by a positive-semidefinite kernel function k(·, ·) that satisfies Mercer’s condition (Cristianini and Shawe-Taylor, 2000). The kernel function, k(G_i., G_j.), measures similarity between the i^th and j^th subjects based on their SNV genotypes in the SNV-set.

When proposing SKAT, Wu et. al assumed that h is distributed N(0, τK), where τ is a variance component indexing the effect of the SNV-set and K is a known kernel matrix with entries defined by a kernel function K_ij = k(G_i·, G_j·). SKAT (Wu et al., 2011) is a test of the null hypothesis for the SNV effects, H₀ : τ = 0, using the following statistic:

Q = \frac{{(y - {\hat{y}}_{0})}^{T} K (y - {\hat{y}}_{0})}{{\hat{ϕ}}_{0}}

(2)

where ${\hat{y}}_{0} = [g^{- 1} ({\hat{β}}_{0}^{T} X_{1}), \dots, g^{- 1} ({\hat{β}}_{0}^{T} X_{n})]$ is the predicted trait from non-genetic covariates and ${\hat{ϕ}}_{0}$ β₀ are maximum likelihood estimates of dispersion parameter and non-genetic effects under H₀, respectively. When y is continuous, ${\hat{ϕ}}_{0} = {\hat{σ}}_{0}^{2}$ is the residual variance of y after accounting for non-genetic covariates, and when y is binary, ${\hat{ϕ}}_{0} = 1$

Here we embed functional genomic elements, or annotations, directly in SKAT and weight the annotations based on their potential relevance to the trait. Given L candidate annotations, let $Q_{l}$ and $K_{l}$ be the $l^{t h}$ candidate SKAT statistic and kernel matrix, and $γ_{l}$ be the convex weight such that ${γ : \sum_{l = 1}^{L} γ_{l} = 1, γ \geq 0}$ . The convex SKAT (cSKAT) statistic is defined as a convex combination of candidate SKAT statistics:

Q_{γ} = \sum_{l = 1}^{L} γ_{l} Q_{l} = \frac{{(y - {\hat{y}}_{0})}^{T} ({\sum_{l = 1}^{L} γ}_{l} K_{l}) (y - {\hat{y}}_{0})}{{\hat{ϕ}}_{0}} .

(3)

Hence, the cSKAT statistic is defined through a convex combination of kernels, ${K : K = \sum_{l = 1}^{L} γ_{l} K_{l}, \sum_{l = 1}^{L} γ_{l} = 1, γ \geq 0}$ . We describe how to construct these kernels from functional genomic annotation in Section 2.3, how to estimate the convex weights in Appendix A.1, and how to evaluate the test statistic null distribution below.

When the combination weights, γ, are fixed or optimized on an independent set of data, the null distribution of Q is a weighted sum of independent χ² variables, $\sum_{j = 1}^{J} λ_{j} χ_{1}^{2}$ , where λ_j are eigenvalues of $(1 / {\hat{ϕ}}_{0}) P^{\frac{1}{2}} (\sum_{l = 1}^{L} γ_{l} K_{l}) P^{\frac{1}{2}}$ , P = V − VX(X^TVX)⁻¹X^TV is the variance of residuals $(y - {\hat{y}}_{0})$ , and $V = {\hat{σ}}_{0}^{2} I_{n}$ for continuous traits and I_n is an (n × n) identity matrix or $V = diag [{\hat{y}}_{01} (1 - {\hat{y}}_{01}), \dots, {\hat{y}}_{0 n} (1 - {\hat{y}}_{0 n})]$ for binary traits and ${\hat{y}}_{0 i} = {logit}^{- 1} ({\hat{β}}_{0}^{T} X_{i})$ is the estimated probability that subject i is a case under H₀. Asymptotic p-values can be computed analytically with the Davies method (Davies, 1980) or approximated with high accuracy with the saddlepoint method (Kuonen, 1999).

When the combination weights, γ, are optimized from the same data used for the test, Q can be evaluated through permutation testing. In a permutation test, the test statistic null distribution would be approximated by fully resampling the observed traits without replacement (i.e. permutation) and recomputing the test statistic for each permutation of trait values. Permutations are computationally burdensome and are difficult to implement for dependent individuals, such as relatives in the Framingham Heart Study. Due to these limitations, we instead use (single) sample-splitting in our simulations and analysis, where weights are estimated in a subset of individuals and the tests are performed in the remaining individuals. Multiple sample splits may be used to improve power and reproducibility (Meinshausen et al., 2009).

2.2. SNV Annotations

In our analyses, we use four classes of annotation: SNV MAF and three ENCODE annotations (ENCODE Project Consortium et al., 2012), which include signals of functional genomic elements along the genome (see Table 1).

Table 1:

SNV Annotations

Class ( $l$ )	# Features	Type	[Min, Max]	Source
Open Chromatin	1	continuous	[0, 1000]	ENCODE
Transcription Factors	11	continuous	[0, 1000]	ENCODE
Histone Modifications	2	continuous	[0, 1000]	ENCODE
SKAT MAF weight	1	continuous	[0, 25]	f_Beta(1,25)(MAF)

Open in a new tab

Each ENCODE signal (scaled 0–1000) is derived from chromatin immunoprecipitation sequencing (ChIP-seq) of a specific DNA-binding element in a specific cell type. For example, the transcription factor Forkhead box protein A2 (FOXA2) has a non-zero number of reads mapping to genomic regions in red blood cells, cancer cells, and other cell types. Read counts at each genomic locus are normalized, compared against the null distribution, and transformed into false discovery rates (q-values). The signals provided by ENCODE are q-values rescaled to 0–1000 to facilitate visualization.

For each functional genomic element, such as FOXA2, we take the maximum signal over all cell types relevant to a trait. For fasting glucose, we use the maximum FOXA2 signal at each genomic location in all available red blood cells, β-cells (if available), and white blood cells. We call this FOXA2 signal vector an annotation “feature”. We call the collection of all transcription factors (TFs) an annotation “class”. Only transcription factors and histone modifications related to fasting metabolism (Goldstein and Hager, 2015) are included in our rare-variant association study of fasting glucose. We construct one kernel for each class from features in Table 2.

Table 2:

SNV Annotation Functions

Class	Feature	Function (abbreviated)
Open Chromatin	DNase-seq Peaks	Indicator of regions accessible for transcription
TF	CEBP-β	Gluconeogenesis
	EGR1	Induces CEBP-α when activated by glucagon
	ERRα	Gluconeogenesis, fatty acid metabolism
	FOXA2	Gluconeogenesis, fatty-acid oxidation (FAO), ketogenesis
	GR	Induces genes encoding fasting-related transcription factors
	HNF4α	Maturity-onset Type 1 diabetes, gluconeogenesis
	NRF1	Links transcription of metabolic genes to cellular growth
	P300	Interacts with PPARγ (regulator of glucose metabolism)
	PGC-1α	Regulates energy metabolism genes
	SREBP-1,2	Lipid homeostasis
	TR	Responsible for many metabolic functions of thyroid hormone
HM	H3K9Ac	Highly correlated with active promoters
HM	H3K36me3	Represses aberrant transcription, involved in denning exons
Minor Allele Frequency	Minor Allele Frequency	Rare SNVs are more likely to be causal (due to natural selection)

Open in a new tab

2.3. Specification of Kernel Matrices

Any positive semidefinite kernel can be specified for K, though in most rare variant studies, the weighted linear kernel is used. As its name suggests, the weighted linear kernel rescales each subject’s genotype vector by fixed weights, and its entries are dot products of these weighted genotypes. Let $w_{k l}$ be the sum of features in annotation class $l$ at SNV k (normalized to the unit interval) and G_ik be the i^th subject’s dosage of the k^th SNV (0 ≤ G_ik ≤ 2). The $l^{t h}$ weighted linear kernel function for subjects i and j is:

{(K_{l})}_{i j} = \sum_{k = 1}^{m} w_{k l}^{2} G_{i k} G_{j k} .

(4)

Optimal SNV weights for traits are unknown, so investigators use estimates based on allele frequencies (i.e. rare alleles are more likely causal due to natural selection) or predicted functional consequence scores derived from functional genomic elements, such as transcription factors. In rare variant studies, the most commonly used weight is the Beta(1,25) density evaluated at the SNV MAF, w_k = f_Beta(1,25)(MAF_k) (Wu et al., 2011). A recent study has also used functional impact scores from bioinformatics tools (Morrison et al., 2017).

In our extension of SKAT, we find better SNV weights for a trait by optimizing the kernel. We consider a class of composite kernels, ${K : K = \sum_{l = 1}^{L} γ_{l} K_{l}, \sum_{l = 1}^{L} γ_{l} = 1, γ \geq 0}$ , from which to select an optimal kernel for the trait. The convex combination weights are optimized through centered kernel-target alignment (Cortes et al., 2012) to emphasize only annotation classes that are potentially relevant to the trait (see Appendix A.1). Before optimization, all base kernels are trace-normalized and centered by pre- and post-multiplying by an (n × n) centering matrix, $C_{n} = (I_{n} - \frac{1_{n} 1_{n}^{T}}{n})$ , 1_n is an (n × 1) vector of 1’s:

K = \frac{C_{n} \tilde{K} C_{n}}{tr (C_{n} \tilde{K} C_{n})}

(5)

where $\tilde{K}$ is a raw kernel matrix and K is the trace-normalized and centered kernel. When all candidate kernels are weighted linear kernels, optimizing the kernel combination is equivalent to optimizing SNV weights (w = [w₁, w₂, …, w_m]) from convex combinations of annotations ${w : w_{k} = \sum_{l = 1}^{L} γ_{l} w_{k l}, \sum_{l = 1}^{L} γ_{l} = 1, γ \geq 0}$ . To ensure γ is interpretable, annotations of each class are normalized to the unit interval.

2.4. Type I Error and Power

We perform simulations to evaluate Type I error and compare power of our proposed test (cSKAT) and four versions of SKAT: unweighted linear combination SKAT (i.e. a sum of SKAT statistics computed separately with one annotation) (Wu et al., 2013), SKAT with ideal weights equal to SNP effect sizes, and SKAT (Wu et al., 2011) and SKAT-O (Lee et al., 2012) with weights as a function of MAF only. We also evaluate the power of a Cauchy combination test or ACAT (Liu et al., 2019) that is a combination of p-values from SKAT tests for each annotation, separately. MK-SKAT (Wu et al., 2013; Urrutia et al., 2016) software has yet to be released and, to our knowledge, is not computationally feasible for these simulations. For cSKAT, we create candidate kernels from MAF and three annotation classes from ENCODE: open chromatin (OC), transcription factors (TF), and histone modification (HM). Annotations used in each test are presented in Table 3.

Table 3:

Tests Compared

Test	Annotation used
ACAT	OC, TF, HM, MAF
cSKAT (proposed)	OC, TF, HM, MAF
cSKAT, restricted to a subset of annotations	OC, MAF
SKAT, unweighted linear combination	OC, TF, HM, MAF
SKAT (ideal weights)	data-generating annotation
SKAT	MAF
SKAT-O	MAF

Open in a new tab

We simulate whole genomes for subjects with the software HAPGEN2 using reference genomes of European ancestry from the 1000 Genomes Project. We adopt the SNV test aggregation of intergenic regions from a previous analysis, where intergenic SNVs were grouped within 4000 base-pair (4kb) windows (Morrison et al., 2013, 2017). The tests are performed for each window with observed cumulative minor allele count (MAC) greater than 20 and evaluated at multiple type I error levels (α).

To assess type I error (α), we run 1,000 simulations with 1,000 subjects whose trait is generated from a standard normal distribution $y_{i} \overset{iid}{~} N (0, 1)$ . In each simulation, we test 20,000 windows, using 500 subjects for optimizing the cSKAT weights (N₀ = 500) and the other 500 subjects for testing at level α (N₁ = 500). Because the weights are optimized on a subset of individuals who are independent from individuals used for hypothesis testing, p-values are computed from the SKAT null distribution $Q \overset{H_{0}}{~} \sum_{j = 1}^{J} λ_{j} χ_{1}^{2}$ , where λ_j are eigenvalues of $(1 / {\hat{ϕ}}_{0}) P^{\frac{1}{2}} (\sum_{l = 1}^{L} γ_{l} K_{l}) P^{\frac{1}{2}}$ , instead of the permutation distribution.

To compare statistical power of the test statistics, we run 100 simulations using 10,000 subjects in 54 windows (of length 4kb) for different trait-generating models. The windows selected for power simulations satisfy several criteria:

Over half of the SNVs have 2 or fewer non-zero annotations
All annotations are present and vary across the window
Number of SNVs ≥ 5
At least one SNV has unique annotation (i.e. ≥ 1 SNV with OC-only, 1 SNV with TF-only, or ≥ 1 SNV with HM-only)

The first condition implies some degree of orthogonality between annotation classes. In windows with highly correlated annotations, estimated weights are unstable and difficult to interpret. The other criteria ensure a diverse set of weights and causal SNVs are included in simulations. All annotation classes must be present in a window to simulate equal class weights. When 20% of SNVs are causal, at least 5 SNVs are needed for one causal SNV. The unique annotation condition ensures less abundant annotations are well-represented in the simulations and do not always coincide with more abundant annotations.

We evaluate the power of the cSKAT statistic given various sample sizes for estimation and testing. Power for SKAT and SKAT-O are evaluated on the full sample in each simulation. Let ${\tilde{w}}_{k l}$ be the sum of all annotations of class $l$ for SNV k normalized to the unit interval. For each window and simulation γ, we select 20% of SNVs as causal based on annotation, $P (SNV k is causal) = \sum_{l = 1}^{4} γ_{l} {\tilde{w}}_{k l} / \sum_{k = 1}^{m} \sum_{l = 1}^{4} γ_{l} {\tilde{w}}_{k l}$ . We simulate a continuous trait for each simulation with a simple linear model:

y = \sum_{k = 1}^{\tilde{m}} β_{k} g_{k} + ε

(6)

where $\tilde{m}$ is the number of causal variants in the window, β_k is the effect of the k^th causal SNV specified as $β_{k} = \sum_{l = 1}^{4} γ_{l} {\tilde{w}}_{k l}$ , and random error $ε ~ N (0, σ_{e}^{2})$ where $σ_{e}^{2}$ is fixed so that SNVs explain 1% of the trait variance, $R_{window}^{2} = 1 %$ . Note that cSKAT weights $\hat{γ}$ are estimated on the kernel-level and differ from the simulation model γ, which are on the scale of the original data.

We also evaluate the robustness of our approach to partial and complete misspecification of SNV weights. Let γ_missp be the degree of misspecification ranging from 0 to 1. We select $γ_{missp} \times \tilde{m}$ causal variants randomly (without regard for annotations) and assign them random uniform effect sizes β_missp ~ U(0, 1). The remaining variants are selected and weighted for annotation exactly the same as in Equation (6). In these simulations, we define partial misspecification as γ_missp = 0.5 and complete misspecification as γ_missp = 1.

SNV annotations w_k are fixed (see Table 2), while annotation class weights γ are varied according to Table 4. In the first scenario, all annotation classes have equal weight: 0.25 to open chromatin (OC), 0.25 to transcription factors (TF), 0.25 to histone modification (HM), and 0.25 to a function of MAF. We assign equal weight to OC and TF in the second scenario (γ_OC = γ_TF = 0.5), assign all weight to TF in the third scenario (γ_TF = 1), and assign all weight to a function of MAF in the last scenario (γ_MAF = 1).

Table 4:

Power Simulation Parameters

$R_{window}^{2}$	γ_OC	γ_TF	γ_HM	γ_MAF	γ_missp
	0.25	0.25	0.25	0.25	0
1%	0.5	0.5	0	0	0
	0	1	0	0	0
	0	0	0	1	0
	0	0	0	0	1
1%	0	0.5	0	0	0.5
	0	1	0	0	0

Open in a new tab

To determine the sample size used for estimating weights in power simulations, we compare estimated weights (averaged over 54 windows and 100 simulations per window) across multiple sample sizes (N₀ = 100, 200, …, 2000). Let ${\hat{γ}}_{l, n}$ be the average estimated weight for annotation $l$ in sample size n and denote the (absolute) difference between weights estimated at consecutive sample sizes $δ_{n} = \sum_{l = 1}^{4} | {\hat{γ}}_{l, n} - {\hat{γ}}_{l, n - 200} |$ . We compare cSKAT power with two different estimation sample sizes, N₀ based on criteria δ_n ≤ 0.1 and δ_n ≤ 0.05.

2.5. Analysis in the Framingham Heart Study

We applied our method to data from the Framingham Heart Study (FHS), an ongoing longitudinal cohort study with detailed medical history, physical examinations, and medical tests (Dawber et al., 1951). The first 5209 FHS participants, called the “Original Cohort”, were recruited in 1948. In 1971, a second cohort (“Offspring”) of 5124 participants was recruited from offspring of the Original Cohort and their spouses (Kannel et al., 1979). Finally, the Third Generation Cohort (“Gen III”) consists of 4095 grandchildren of the Original Cohort and children of Offspring Cohort spouses whose parents were not in the Original Cohort (Splansky et al., 2007). While originally developed as a cardiovascular cohort study, the FHS includes many other traits, such as fasting glucose and various cancers. In our analysis, we tested associations between ≥8-hour fasting glucose and SNVs in genes or intergenic (4kb)windows.

We used genetic and trait data for 6419 diabetes-free participants from the Offspring Cohort at exam 5 and Third Generation Cohort at exam 1. Fasting glucose residuals were computed within each sex and cohort by regressing fasting glucose on age and age squared.

We constructed weighted linear kernels from each annotation class in Table 2. All features within each class were summed and resulting SNV weights were normalized to the unit interval. When applying our method, we estimated convex weights in an unrelated subset of individuals (n=1814) and used the remaining individuals (n=4605) to test the association between fasting glucose and SNVs within genes and intergenic windows. A modified SKAT statistic, famSKAT (Chen et al., 2013), was used in the association test to account for relatedness between FHS participants. We also performed SKAT-O in the full set of individuals (n=6419). All analyses were run in R version 3.4.3 (R Core Team, 2019) with the seqMeta package (Voorman et al., 2013), which implements the famSKAT method to account for relatedness between participants.

3. Results

3.1. Simulation Results

Using data-splitting, type I error (α) of cSKAT was controlled at all levels but 0.05 and was slightly conservative at type I error levels below 0.005 (see Table 5). Inflation at α = 0.05 may be due to small sample size. The null distribution should be evaluated through permutation testing when possible to correct for this departure from the nominal significance level.

Table 5:

Type I Error for cSKAT

α	Observed Type I Error	95% CI
0.05	0.05163	(0.05153, 0.05173)
0.005	0.00501	(0.00498, 0.05040)
0.001	0.00097	(0.00096, 0.00098)
5 × 10⁻⁴	4.9 × 10⁻⁴	(4.8 × 10⁻⁴, 5.0 × 10⁻⁴)
1 × 10⁻⁴	9.2 × 10⁻⁵	(8.8 × 10⁻⁵, 9.7 × 10⁻⁵)
1 × 10⁻⁵	9.6 × 10⁻⁶	(8.3 × 10⁻⁶, 1.1 × 10⁻⁵)
2.5 × 10⁻⁶	2.4 × 10⁻⁶	(1.7 × 10⁻⁶, 3.1 × 10⁻⁶)

Open in a new tab

Figure 1 is a plot of estimated cSKAT weights at different sample sizes for each simulation scenario. Note that cSKAT weights $\hat{γ}$ are estimated from the variance component model used for SKAT which diėrs from the simulation model, and consequently $\hat{γ}$ do not converge to γ. In all simulation scenarios, estimated cSKAT weights $\hat{γ}$ converged within 1000 samples for criteria δ_n ≤ 0.05 and within 600 samples for criteria δ_n ≤ 0.1.

Figure 2 displays empirical power for cSKAT, SKAT, and SKAT-O computed at α = 10⁻⁸, averaged over the 54 windows and 100 simulations per window. Power was evaluated for sample sizes N=200 to 2000 (by 100), N=2000 to 5000 (by 500), and N=5000 to 10000 (by 1000) with estimation subset N₀ withheld from the cSKAT test. For most sample sizes (N ≤ 8000), cSKAT had greater power for the smaller estimation subset (N₀ = 600) than the larger estimation subset (N₀ = 1000), indicating a preference for test sample size (N₁) over optimality of weights $\hat{γ}$ . Under all simulated scenarios, cSKAT with N₀ = 600 had greater power than SKAT and SKAT-O in moderately large samples (N ≥ 5000). For smaller samples (N ≤ 4000), cSKAT was less powerful than SKAT and SKAT-O due to sample loss from data splitting. Power for cSKAT improved when annotation weights were more concentrated, with up to 15% higher power than SKAT-O when transcription factors had a weight of 1.

Figure 3 is a comparison of statistical power for different methods (see Table 3). Power was computed at α = 10⁻⁸, averaged over the 54 windows and 100 simulations per window. Power was evaluated for sample sizes N=3000, 5000, and 7000. Our proposed cSKAT approach was more powerful than the unweighted combination SKAT and had comparable power to ACAT in large samples (N > 7000). In moderately large samples (N > 5000), cSKAT was more powerful than unweighted combination SKAT when the only causal annotation was MAF-based. For smaller samples (N < 5000), cSKAT was less powerful than unweighted combination SKAT and ACAT potentially due to reduced sample size from sample-splitting.

The ACAT approach, which is a combination of p-values from SKAT tests for each annotation, was more powerful than cSKAT and SKAT tests in most scenarios, almost reaching the power of SKAT with ideal weights (an upper bound on statistical power). In the scenario where the only causal annotation was MAF-based, however, the standard SKAT and SKAT-O with only MAF-based annotation had greater power than ACAT and cSKAT.

3.2. Results in FHS

Our cSKAT test had a low genomic inflation factor (λ_GC = 1.037) comparable to the SKAT-O test (λ_GC = 1.039). The Q-Q plots (see Figure 4) indicate the estimation and test subsets were sufficiently independent for cSKAT.

Due to small sample size in FHS (test subset n=4605), we found no genome-wide significant associations (p < 10⁻⁷) between fasting glucose and the tested regions (see Figure 5). However, two of the top cSKAT associations had potential biological connections to fasting glucose and were undetected by SKAT-O (see Table 6). The strongest association was in chromosome 2 for a region within 20kb of ROCK2 (cSKAT p = 2.11 × 10⁻⁵, SKAT-O p = 0.10), which has been shown to induce obesity mediated insulin resistance and cardiac dysfunction (Soliman et al., 2015). In this region near ROCK2, the estimated annotation weights were 1 for transcription factors and 0 for all other annotation classes, suggesting the region may have a regulatory effect on ROCK2. The second highest association was found in the gene CPLX1 (cSKAT p = 5.26 × 10⁻⁵, SKAT-O p = 0.39), which has previously been implicated in glucose-induced secretion of insulin by pancreatic beta-cells (Abderrahmani et al., 2004). The estimated annotation weights in CPLX1 were large for histone modification (0.296) and minor allele frequency (0.642). The other top associations had no biological connection to fasting glucose.

Figure 5: — Manhattan plots of the rare-variant association study in FHS using our proposed cSKAT approach. The solid line is the genome-wide significance level (α = 1.47 × 10⁻⁷) and dotted line is a suggestive threshold (α = 10⁻⁴).

Table 6:

Top cSKAT Associations

Chr	Mid-bp	Nearest Gene	Distance from Gene	p-value		n_SNVs	γ_OC ^†	γ_TF	γ_HM	γ_MAF
Chr	Mid-bp	Nearest Gene	Distance from Gene	cSKAT	SKAT-O	n_SNVs	γ_OC ^†	γ_TF	γ_HM	γ_MAF
2	11506210	ROCK2	22 kb	2.1 × 10⁻⁵	0.10	7	0	1	0	0
1	804928	CPLX1	0	5.3 × 10⁻⁵	0.39	11	0.021	0.041	0.296	0.642
1	222069979	LOC101929771	56 kb	6.1 × 10⁻⁵	0.74	16	0	0.316	0.440	0.244
3	50476518	CACNA2D2	0	6.5 × 10⁻⁵	0.03	312	0	0	1	0
4	54857928	RPL21P44	5 kb	9.7 × 10⁻⁵	0.34	10	0.035	0.458	0	0.506
8	41910691	KAT6A	1 kb	9.9 × 10⁻⁵	0.08	6	0	1	0	0

Open in a new tab

^†

λ_OC = estimated weight for open chromatin annotation (0 ≤ λ ≤ 1)

Table 7 lists highly annotated SNVs (with cSKAT weight > 40%) in the top two associated windows. Annotations in Table 7 are presented in their original scale (bounded between 0 and 1000). The cSKAT-estimated SNV weights ( $w_{cSKAT} = \sum_{l = 1}^{4} {\hat{γ}}_{l} w_{l}$ ) have been rescaled so all SNV weights sum to 1, and represent the proportion of trait variance explained by the window that is attributable to the SNV. In the window near ROCK2, our method attributed 96% of SNV weight to two SNVs enriched for transcription factors and open chromatin. In CPLX1, 41% of trait variance explained by the gene was attributed to one SNV located at a histone modification. A comparison of cSKAT and SKAT-O SNV weights are provided in Appendix A.2.

Table 7:

SNVs with cSKAT Weight > 40%

Chr	bp	Gene	Major/Minor	MAF	w_cSKAT ^†	OC ^‡	TF				HM
Chr	bp	Gene	Major/Minor	MAF	w_cSKAT ^†	OC ^‡	CEBP-β	FOXA2	HNF4α	P300	H3K9Ac
2	11506578	ROCK2	T/C	0.0017	0.48	360	968	752	959	1000	128
2	11506743	ROCK2	C/A	0.0022	0.48	360	968	752	959	1000	128
4	818583	CPLX1	T/A	0.0017	0.41	229	0	0	0	0	922

Open in a new tab

^†

w_cSKAT = estimated weight for SNV (0 ≤ w ≤ 1). Proportion of total SNV weight in the gene attributable to the SNV. The two SNVs in ROCK2 have identical weights because they have the same TF annotation value, and TF had 100% weight in ROCK2 (i.e. w_cSKAT = TF value)

^‡

Annotations range from 0 (no signal) to 1000 (max signal)

We also compared SKAT and cSKAT weights in G6PC2, a gene with known rare variant associations with fasting glucose (Wessel et al., 2015; Mahajan et al., 2015). Four likely candidates were found to be driving the joint association between fasting glucose and rare variants in G6PC2: rs138726309, rs2232323, rs146779637, and rs2232326. In our FHS analysis, we found all four variants had greater cSKAT weight than the standard SKAT weight derived from MAF and a Beta(1,25) distribution. In particular, the cSKAT weight for rs2232326 (MAF = 0.0016) was three-fold higher than the SKAT weight (w_cSKAT = 0.065 vs w_SKAT = 0.02).

The computational burden of the optimization step of cSKAT is small relative to the burden of computing SKAT statistics and p-values. In our real data application, weight optimization was completed in 34.4 CPU hours, while association testing required 337.1 CPU hours for 340,136 genes and 4kb sliding windows including a total of 5,442,193 rare SNPs (MAF < 0.05).

4. Discussion

In this paper, we present a novel method, cSKAT, for optimizing the rare variant SKAT statistic over multiple potentially relevant SNV annotations. The method has higher power than SKAT and SKAT-O in large cohorts (N ≥ 5000) when SNV weights are not completely misspecified, and provides interpretable SNV weights that can inform biological functional studies.

In FHS, we find a possible association between fasting glucose and rare variants near ROCK2 (p = 2.1 × 10⁻⁵) and within CPLX1 (p = 5.3 [notdef] 10⁻⁵), genes involved in obesity mediated insulin resistance (Soliman et al., 2015) and glucose-induced insulin secretion by pancreatic beta-cells (Abderrahmani et al., 2004), respectively. In the window near ROCK2, our method assigns 96% of SNV weight to two SNVs at an active transcription factor (TF) binding site. At these highly weighted loci, the strongest TF signal is P300, which interacts directly with ROCK2. ROCK2 regulates the acetyltransferase activity of P300 through phosphorylation (Tanaka et al., 2006). The second largest TF signals, CEBP-β and HNF4-α, are only indirectly related to ROCK2, e.g. ROCK2 knockdown has shown to increase gene expression of CEBPD (Li et al., 2015), which forms heterodimers with CEBP-α. ROCK1, which often shares functions with ROCK2, interacts with factor HNF4-α (Yoshikawa et al., 2015). There is no known link between ROCK2 and the other active transcription factor at this site, FOXA2.

In CPLX1, 41% of trait variance explained by the gene was attributed to one SNV with high values of histone modification H3K9Ac. H3K9Ac serves an important role in transcription and its loss or depletion in promoters can reduce gene expression. In a recent study, investigators hypothesized that H3K9Ac recruits proteins downstream of transcription initiation which are needed for the next step of transcription (Gates et al., 2017). Replication is required to validate these results in other cohorts (using the estimated annotation weights from FHS).

Our method has several limitations. Data-splitting allows us to compute p-values efficiently from the SKAT null distribution but reduces power because samples are excluded from testing. Sample splitting may also result in splits where variants present in one split are unobserved in another split due to low frequency. To address this limitation, we have optimized kernels based on annotations rather than optimizing single-variant weights. If an annotation has a strong biological connection to the trait, we would expect stronger association between annotated SNVs (genome-wide) and the trait. For example, Purcell et al. (2014) show genome-wide sets of annotated variants (e.g. indel and frameshift variants) are enriched for associations with schizophrenia. Meta-analysis can mitigate loss of power due to sample loss, and multiple sample splitting can ensure all SNVs are involved in kernel optimization (Meinshausen et al., 2009). Further simulations will be required to evaluate other kernel optimization schemes for meta-analysis and find the number of sample splits that balances the increase in power against the increase in computation time.

We also optimize a standardized test statistic rather than a p-value. The distribution of SKAT statistics under different kernels is complex and rescaling may not be adequate in some cases. We restrict our application to linear kernels with this limitation in mind. For more complex kernels, we recommend minimizing p-values rather than maximizing a standardized statistic.

Another limitation of our method is a priori selection of annotation. All rare variant tests require a priori specification of annotation, but our method is especially sensitive to choice of annotation. When annotations are too correlated, the optimization problem is not strictly convex and its solution may not be unique (Cortes et al., 2012). Annotations may also be too sparse or completely absent from many variant sets. In both cases, the optimized kernel weights γ would be difficult to interpret. Here, we sidestep the issue by aggregating annotations within biological classes and using only annotations related to fasting metabolism. For more extensive annotations that are highly correlated, we suggest creating orthogonal annotation classes with principal component analysis (Jolliffe and Cadima, 2016). In regions where annotations are too sparse for cSKAT, we instead recommend the standard SKAT approach followed by post-hoc variable selection to prioritize individual rare variants, such as Kernel Iterative Feature Extraction (KNIFE) (He et al., 2016).

We included only four annotation sources in this paper based on their well-documented involvement in regulating fasting metabolism (Goldstein and Hager, 2015), but cSKAT can easily accommodate additional annotations. For example, several schizophrenia studies have shown rare disruptive variants (nonsense, essential splice site or frameshift) substantially increase risk for schizophrenia (Purcell et al., 2014; Singh et al., 2017; Teng et al., 2018). In these studies, separate analyses were conducted for disruptive variants and other rare variant sets. Using cSKAT, the separate analyses could be combined by pooling all SNVs and coding set memberships as binary annotations (0 for exclusion, 1 for inclusion). The optimized cSKAT weights would then enable direct comparisons between disruptive variants and other variant classes. Given rapidly growing and publicly available functional genomic annotations, adaptive annotation weighting is now an invaluable tool for pinpointing the biological mechanisms driving associations between rare variants and complex traits.

5. Acknowledgments

This work was partially supported by NIH grant U01 DK078616. The Framingham Heart Study is conducted and supported by the National Heart, Lung, and Blood Institute (NHLBI) in collaboration with Boston University (Contract No. N01-HC-25195). This manuscript was not prepared in collaboration with investigators of the Framingham Heart Study and does not necessarily reflect the opinions or views of the Framingham Heart Study, Boston University, or NHLBI. Funding for SHARe Affymetrix genotyping was provided by NHLBI Contract N02-HL64278. SHARe Illumina genotyping was provided under an agreement between Illumina and Boston University.

A. Appendix

A.1. cSKAT Optimization

Existing rare variant tests with adaptive annotation selection use the minimum p-value over annotations. Two such tests have been developed for SKAT (Urrutia et al., 2016) and SKAT-O (He et al., 2017), where the test statistic is the minimum p-value among SKAT(-O) statistics computed for each annotation. Denote $p_{Q_{l}}$ as the p-value for the SKAT(-O) statistic using the $l^{t h}$ annotation. The minP statistic is:

T = mi n_{p} p_{Q_{1}}, p_{Q_{2}}, \dots p_{Q_{l}} .

(7)

The significance of the T statistic can be evaluated analytically. The minimum p-value approach performs well but scales poorly to combinations of annotation, where p-values must be computed for a grid of combination weights λ and numerical integration is required to compute the p-value of T. On the other hand, maximizing combinations of test statistics without accounting for their p-values results in poor power (Zhan et al., 2017a,b). In our application, adding SNVs to a SNV-set would increase the test statistic but also increase the eigenvalues of the kernel matrix. Intuitively, rescaling kernel test statistics by their kernel matrix eigenvalues could help connect test statistic maximization to p-value minimization. For example, the null distribution of a SKAT statistic rescaled by its eigenvalues can be approximated through the Satterthwaite approach (Lumley, 2011):

\frac{Q}{{‖ λ_{γ} ‖}_{2}} \overset{approx}{~} a χ_{v}^{2}

(8)

where scale parameter $a = \frac{{‖ λ_{γ} ‖}_{2}}{{‖ λ_{γ} ‖}_{1}}$ and degrees of freedom $v = {(\frac{{‖ λ_{γ} ‖}_{1}}{{‖ λ_{γ} ‖}_{2}})}^{2}$ are ratios of the l₁ and l₂ norms of the kernel matrix eigenvalues λ_γ. Thus, increasing eigenvalues will increase the scale but decrease the degrees of freedom. While we optimize a standardized test statistic, Q/‖λ_γ‖₂ the distribution of SKAT statistics under different kernels is complex and rescaling may not be adequate in some cases. We restrict our application to linear kernels with this limitation in mind. For more complex kernels, we recommend minimizing p-values rather than maximizing a standardized statistic.

To incorporate multiple sources of annotation, we optimize a convex combination of SKAT statistics, $Q_{γ} = \sum_{l = 1}^{L} γ_{l} Q_{l}$ , where the eigenvalues λ of test statistic Q_γ depend on the convex weights γ. We show that maximizing Q/‖λ‖₂ is equivalent to maximizing centered alignment $A$ between trait y and convex combination kernel K_γ for continuous trait with no non-genetic covariates. For continuous trait, the cSKAT statistic can be rewritten as $Q_{γ} = {\hat{σ}}_{0}^{- 2} y^{T} P K_{γ} P y$ where projection matrix P = I_n − X(X^T X)⁻¹X^T. When there are no non-genetic covariates, P is simply a centering matrix ( $I_{n} - \frac{1_{n} 1_{n}^{T}}{n}$ ) and, assuming all candidate kernels are centered, the cSKAT statistic reduces to $Q_{γ} = {\hat{σ}}_{0}^{- 2} y^{T} K_{γ} y$ . Let 〈. , .〉_F and ∥.∥_F denote the Frobenius inner product and norm, and the centered kernel-target alignment be $A (y y^{T}, K_{γ}) = \frac{{〈 y y^{T}, \sum_{l = 1}^{L} γ_{l} K_{l} 〉}_{F}}{{‖ y y^{T} ‖}_{F} {‖ K_{γ} ‖}_{F}}$ Then observe:

\frac{Q_{γ}}{‖ λ ‖_{2}} \propto \frac{y^{T} K_{γ} y}{\sqrt{\sum_{j = 1}^{J} λ_{j}^{2}}} = \frac{tr (y y^{T} \sum_{l = 1}^{L} γ_{l} K_{l})}{\sqrt{tr (K_{γ}^{2})}} \propto \frac{{〈 y y^{T}, \sum_{l = 1}^{L} γ_{l} K_{l} 〉}_{F}}{{‖ y y^{T} ‖}_{F} {‖ K_{γ} ‖}_{F}} = A (y y^{T}, K_{γ}) .

(9)

Hence, maximizing the cSKAT statistic scaled by its eigenvalues is equivalent to maximizing the centered kernel target alignment between trait and convex combination kernel:

arg max_{γ} \frac{Q_{γ}}{‖ λ ‖_{2}} = arg max_{γ} A (y y^{T}, K_{γ}) .

(10)

When there are non-genetic covariates, the optimal convex weights for a trait maximize the kernel-target alignment between the residuals of trait regressed on non-genetic covariates ( $e = y - {\hat{y}}_{0}$ ) and the centered convex combination kernel ( $K_{γ} = \sum_{l = 1}^{L} γ_{l} K_{l}$ ):

γ = arg max_{γ} \frac{{〈 e e^{T}, K_{γ} 〉}_{F}}{{‖ K_{γ} ‖}_{F}} .

(11)

Let a be the vector of inner products between residuals and centered candidate kernels, $a = {({〈 e e^{T}, K_{1} 〉}_{F}, \dots, {〈 e e^{T}, K_{l} 〉}_{F})}^{T}$ , M denote the matrix of inner products between candidate kernels, i.e. M_jk = 〈K_j, K_k〉_F. Then the optimal convex weights, γ = v/∥v∥, are the solution to the following Quadratic Programming (QP) problem:

min_{v \geq 0} v^{T} Mv - 2 v^{T} a .

(12)

A.2. Comparison of SNV weights used in FHS

In Figure 6, we compare cSKAT and SKAT weights for SNVs in the windows included in Table 6. Both cSKAT and SKAT weights were rescaled to [0, 1] to facilitate comparisons. SKAT weights were generally uniform, with differences between cSKAT and SKAT weights driven by extreme cSKAT weights. SNVs with large cSKAT weights generally had large annotation values and a strong association with the trait. In two of the six windows, for example, cSKAT assigned the majority of weight in the window to a few SNVs at a transcription factor binding site and histone modification, respectively (Table 7).

A.3. Weight Misspecification

Figure 7 displays empirical power for cSKAT, SKAT, and SKAT-O for different levels of misspecified SNV weights: complete misspecification (γ_missp = 1), partial misspecification γ_missp = 0.5), and no misspecification (γ_missp = 0). When SNV weights were completely misspecified, cSKAT was less powerful than SKAT and SKAT-O due to sample loss from data splitting. On the other hand, cSKAT was robust to partial misspecification, defined as half of causal SNVs being selected randomly (with random uniform effect) and half of causal SNVs being selected from SNVs with non-zero TF annotations. In all misspecification scenarios, N₀ = 600 samples were sufficient to estimate cSKAT weights. Using more samples in the estimation subset (N₀ = 1000) resulted in lower power because fewer samples were available for testing (N₁ = N − N₀).

Footnotes

⁶

Data Accessibility

The Framingham Heart Study data used in this study are available from dbGaP (Study Accession: phs000007.v30.p11).

References

Abderrahmani A, Niederhauser G, Plaisance V, Roehrich M-E, Lenain V, Coppola T, Regazzi R, and Waeber G (2004). Complexin i regulates glucose-induced secretion in pancreatic β-cells. Journal of cell science, 117(11):2239–2247. [DOI] [PubMed] [Google Scholar]
Barnett I, Mukherjee R, and Lin X (2017). The generalized higher criticism for testing snp-set effects in genetic association studies. Journal of the American Statistical Association, 112(517):64–76. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chen H, Meigs JB, and Dupuis J (2013). Sequence kernel association test for quantitative traits in family samples. Genetic epidemiology, 37(2):196–204. [DOI] [PMC free article] [PubMed] [Google Scholar]
Conneely KN and Boehnke M (2007). So many correlated tests, so little time! rapid adjustment of p values for multiple correlated tests. The American Journal of Human Genetics, 81(6):1158–1168. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cortes C, Mohri M, and Rostamizadeh A (2012). Algorithms for learning kernels based on centered alignment. Journal of Machine Learning Research, 13:795–828. [Google Scholar]
Cristianini N and Shawe-Taylor J (2000). An introduction to support vector machines and other kernel-based learning methods. Cambridge university press. [Google Scholar]
Davies RB (1980). The distribution of a linear combination of x2 random variables. Applied Statistics, 29(3):323–333. [Google Scholar]
Dawber TR, Meadors GF, and Moore FE Jr (1951). Epidemiological approaches to heart disease: the framingham study. American Journal of Public Health and the Nations Health, 41(3):279–286. [DOI] [PMC free article] [PubMed] [Google Scholar]
ENCODE Project Consortium et al. (2012). An integrated encyclopedia of dna elements in the human genome. Nature, 489(7414):57. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gates LA, Shi J, Rohira AD, Feng Q, Zhu B, Bedford MT, Sagum CA, Jung SY, Qin J, Tsai M-J, et al. (2017). Acetylation on histone h3 lysine 9 mediates a switch from transcription initiation to elongation. Journal of Biological Chemistry, 292(35):14456–14472. [DOI] [PMC free article] [PubMed] [Google Scholar]
Goldstein I and Hager GL (2015). Transcriptional and chromatin regulation during fasting–the genomic era. Trends in Endocrinology & Metabolism, 26(12):699–710. [DOI] [PMC free article] [PubMed] [Google Scholar]
He Q, Cai T, Liu Y, Zhao N, Harmon QE, Almli LM, Binder EB, Engel SM, Ressler KJ, Conneely KN, et al. (2016). Prioritizing individual genetic variants after kernel machine testing using variable selection. Genetic epidemiology, 40(8):722–731. [DOI] [PMC free article] [PubMed] [Google Scholar]
He Z, Xu B, Lee S, and Ionita-Laza I (2017). Unified sequence-based association tests allowing for multiple functional annotations and meta-analysis of noncoding variation in metabochip data. The American Journal of Human Genetics, 101(3):340–352. [DOI] [PMC free article] [PubMed] [Google Scholar]
Jolliffe IT and Cadima J (2016). Principal component analysis: a review and recent developments. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 374(2065):20150202. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kannel WB, Feinleib M, McNamara PM, Garrison RJ, and Castelli WP (1979). An investigation of coronary heart disease in families: the framingham offspring study. American journal of epidemiology, 110(3):281–290. [DOI] [PubMed] [Google Scholar]
Kuonen D (1999). Miscellanea. saddlepoint approximations for distributions of quadratic forms in normal variables. Biometrika, 86(4):929–935. [Google Scholar]
Lee S, Emond MJ, Bamshad MJ, Barnes KC, Rieder MJ, Nickerson DA, Christiani DC, Wurfel MM, and Lin X (2012). Optimal unified approach for rare-variant association testing with application to small-sample case-control whole-exome sequencing studies. American Journal of Human Genetics, 91(2):224–237. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li B and Leal SM (2008). Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. The American Journal of Human Genetics, 83(3):311–321. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li M, Zhou W, Yuan R, Chen L, Liu T, Huang D, Hao L, Xie Y, and Shao J (2015). Rock2 promotes hcc proliferation by cebpd inhibition through phospho-gsk3β/β-catenin signaling. FEBS letters, 589(9):1018–1025. [DOI] [PubMed] [Google Scholar]
Liu Y, Chen S, Li Z, Morrison AC, Boerwinkle E, and Lin X (2019). Acat: A fast and powerful p value combination method for rare-variant analysis in sequencing studies. The American Journal of Human Genetics, 104(3):410–421. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lumley T (2011). Complex surveys: a guide to analysis using R, volume 565 John Wiley & Sons. [Google Scholar]
Madsen BE and Browning SR (2009). A groupwise association test for rare mutations using a weighted sum statistic. PLoS genetics, 5(2):e1000384. [DOI] [PMC free article] [PubMed] [Google Scholar]
Mahajan A, Sim X, Ng HJ, Manning A, Rivas MA, Highland HM, Locke AE, Grarup N, Im HK, Cingolani P, et al. (2015). Identification and functional characterization of g6pc2 coding variants influencing glycemic traits define an effector transcript at the g6pc2-abcb11 locus. PLoS genetics, 11(1):e1004876. [DOI] [PMC free article] [PubMed] [Google Scholar]
Meinshausen N, Meier L, and Bühlmann P (2009). P-values for high-dimensional regression. Journal of the American Statistical Association, 104(488):1671–1681. [Google Scholar]
Minica CC, Genovese G, Hultman CM, Pool R, Vink JM, Neale MC, Dolan CV, and Neale BM (2017). The Weighting is the Hardest Part: On the Behavior of the Likelihood Ratio Test and the Score Test Under a Data-Driven Weighting Scheme in Sequenced Samples. Twin Res Hum Genet, 20(2):108–118. [DOI] [PMC free article] [PubMed] [Google Scholar]
Morgenthaler S and Thilly WG (2007). A strategy to discover genes that carry multi-allelic or mono-allelic risk for common diseases: a cohort allelic sums test (cast). Mutation Research/Fundamental and Molecular Mechanisms of Mutagenesis, 615(1):28–56. [DOI] [PubMed] [Google Scholar]
Morrison A, Voorman A, Johnson A, Liu X, Yu J, Li A, Muzny D, Yu F, Rice K, Zhu C, Bis J, Heiss G, Donnell C, Psaty B, Cupples L, Gibbs R, and Boerwinkle E (2013). Whole-genome sequence-based analysis of high-density lipoprotein cholesterol. Nature Genetics, 45(8):7–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
Morrison AC, Huang Z, Yu B, Metcalf G, Liu X, Ballantyne C, Coresh J, Yu F, Muzny D, Feofanova E, Rustagi N, Gibbs R, and Boerwinkle E (2017). Practical Approaches for Whole-Genome Sequence Analysis of Heart- and Blood-Related Traits. The American Journal of Human Genetics, 100(2):1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
Purcell SM, Moran JL, Fromer M, Ruderfer D, Solovieff N, Roussos P, Odushlaine C, Chambert K, Bergen SE, Kähler A, et al. (2014). A polygenic burden of rare disruptive mutations in schizophrenia. Nature, 506(7487):185. [DOI] [PMC free article] [PubMed] [Google Scholar]
R Core Team (2019). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. [Google Scholar]
Singh T, Walters JT, Johnstone M, Curtis D, Suvisaari J, Torniainen M, Rees E, Iyegbe C, Blackwood D, McIntosh AM, et al. (2017). The contribution of rare variants to risk of schizophrenia in individuals with and without intellectual disability. Nature genetics, 49(8):1167. [DOI] [PMC free article] [PubMed] [Google Scholar]
Soliman H, Nyamandi V, Garcia-Patino M, Varela JN, Bankar G, Lin G, Jia Z, and MacLeod KM (2015). Partial deletion of rock2 protects mice from high-fat diet-induced cardiac insulin resistance and contractile dysfunction. American Journal of Physiology-Heart and Circulatory Physiology, 309(1):H70–H81. [DOI] [PubMed] [Google Scholar]
Splansky GL, Corey D, Yang Q, Atwood LD, Cupples LA, Benjamin EJ, D’Agostino RB Sr, Fox CS, Larson MG, Murabito JM, et al. (2007). The third generation cohort of the national heart, lung, and blood institute’s framingham heart study: design, recruitment, and initial examination. American journal of epidemiology, 165(11):1328–1335. [DOI] [PubMed] [Google Scholar]
Sun R, Hui S, Bader GD, Lin X, and Kraft P (2019). Powerful gene set analysis in gwas with the generalized berk-jones statistic. PLoS genetics, 15(3):e1007530. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tanaka T, Nishimura D, Wu R-C, Amano M, Iso T, Kedes L, Nishida H, Kaibuchi K, and Hamamori Y (2006). Nuclear rho kinase, rock2, targets p300 acetyltransferase. Journal of Biological Chemistry, 281(22):15320–15329. [DOI] [PubMed] [Google Scholar]
Teng S, Thomson PA, McCarthy S, Kramer M, Muller S, Lihm J, Morris S, Soares D, Hennah W, Harris S, et al. (2018). Rare disruptive variants in the disc1 interactome and regulome: association with cognitive ability and schizophrenia. Molecular psychiatry, 23(5):1270. [DOI] [PMC free article] [PubMed] [Google Scholar]
Urrutia E, Lee S, Maity A, Zhao N, Shen J, Li Y, and Wu MC (2016). Rare variant testing across methods and thresholds using the multi-kernel sequence kernel association test (MK-SKAT). Statistics and its interface, 8(4):495–505. [DOI] [PMC free article] [PubMed] [Google Scholar]
Voorman A, Brody J, Chen H, Lumley T, Davis B (2013). seqMeta: an R package for meta-analyzing region-based tests of rare DNA variants. Retrieved from https://CRAN.R-project.org/package=seqMeta
Wessel J, Chu AY, Willems SM, Wang S, Yaghootkar H, Brody JA, Dauriz M, Hivert M-F, Raghavan S, Lipovich L, et al. (2015). Low-frequency and rare exome chip variants associate with fasting glucose and type 2 diabetes susceptibility. Nature communications, 6:5897. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wu MC, Lee S, Cai T, Li Y, Boehnke M, and Lin X (2011). Rare-variant association testing for sequencing data with the sequence kernel association test. American Journal of Human Genetics, 89(1):82–93. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wu MC, Maity A, Lee S, Simmons EM, Harmon QE, Lin X, Engel SM, Molldrem JJ, and Armistead PM (2013). Kernel machine snp-set testing under multiple candidate kernels. Genetic epidemiology, 37(3):267–275. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yoshikawa T, Wu J, Otsuka M, Kishikawa T, Ohno M, Shibata C, Takata A, Han F, Kang YJ, Chen C-YA, et al. (2015). Rock inhibition enhances microrna function by promoting deadenylation of targeted mrnas via increasing paip2 expression. Nucleic acids research, 43(15):7577–7589. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhan X, Plantinga A, Zhao N, and Wu MC (2017a). A fast small-sample kernel independence test for microbiome community-level association analysis. Biometrics, 73(4):1453–1463. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhan X, Zhao N, Plantinga A, Thornton TA, Conneely KN, Epstein MP, and Wu MC (2017b). Powerful genetic association analysis for common or rare variants with high-dimensional structured traits. Genetics, 206(4):1779–1790. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R1] Abderrahmani A, Niederhauser G, Plaisance V, Roehrich M-E, Lenain V, Coppola T, Regazzi R, and Waeber G (2004). Complexin i regulates glucose-induced secretion in pancreatic β-cells. Journal of cell science, 117(11):2239–2247. [DOI] [PubMed] [Google Scholar]

[R2] Barnett I, Mukherjee R, and Lin X (2017). The generalized higher criticism for testing snp-set effects in genetic association studies. Journal of the American Statistical Association, 112(517):64–76. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] Chen H, Meigs JB, and Dupuis J (2013). Sequence kernel association test for quantitative traits in family samples. Genetic epidemiology, 37(2):196–204. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] Conneely KN and Boehnke M (2007). So many correlated tests, so little time! rapid adjustment of p values for multiple correlated tests. The American Journal of Human Genetics, 81(6):1158–1168. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] Cortes C, Mohri M, and Rostamizadeh A (2012). Algorithms for learning kernels based on centered alignment. Journal of Machine Learning Research, 13:795–828. [Google Scholar]

[R6] Cristianini N and Shawe-Taylor J (2000). An introduction to support vector machines and other kernel-based learning methods. Cambridge university press. [Google Scholar]

[R7] Davies RB (1980). The distribution of a linear combination of x2 random variables. Applied Statistics, 29(3):323–333. [Google Scholar]

[R8] Dawber TR, Meadors GF, and Moore FE Jr (1951). Epidemiological approaches to heart disease: the framingham study. American Journal of Public Health and the Nations Health, 41(3):279–286. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] ENCODE Project Consortium et al. (2012). An integrated encyclopedia of dna elements in the human genome. Nature, 489(7414):57. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] Gates LA, Shi J, Rohira AD, Feng Q, Zhu B, Bedford MT, Sagum CA, Jung SY, Qin J, Tsai M-J, et al. (2017). Acetylation on histone h3 lysine 9 mediates a switch from transcription initiation to elongation. Journal of Biological Chemistry, 292(35):14456–14472. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] Goldstein I and Hager GL (2015). Transcriptional and chromatin regulation during fasting–the genomic era. Trends in Endocrinology & Metabolism, 26(12):699–710. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] He Q, Cai T, Liu Y, Zhao N, Harmon QE, Almli LM, Binder EB, Engel SM, Ressler KJ, Conneely KN, et al. (2016). Prioritizing individual genetic variants after kernel machine testing using variable selection. Genetic epidemiology, 40(8):722–731. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] He Z, Xu B, Lee S, and Ionita-Laza I (2017). Unified sequence-based association tests allowing for multiple functional annotations and meta-analysis of noncoding variation in metabochip data. The American Journal of Human Genetics, 101(3):340–352. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] Jolliffe IT and Cadima J (2016). Principal component analysis: a review and recent developments. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 374(2065):20150202. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] Kannel WB, Feinleib M, McNamara PM, Garrison RJ, and Castelli WP (1979). An investigation of coronary heart disease in families: the framingham offspring study. American journal of epidemiology, 110(3):281–290. [DOI] [PubMed] [Google Scholar]

[R16] Kuonen D (1999). Miscellanea. saddlepoint approximations for distributions of quadratic forms in normal variables. Biometrika, 86(4):929–935. [Google Scholar]

[R17] Lee S, Emond MJ, Bamshad MJ, Barnes KC, Rieder MJ, Nickerson DA, Christiani DC, Wurfel MM, and Lin X (2012). Optimal unified approach for rare-variant association testing with application to small-sample case-control whole-exome sequencing studies. American Journal of Human Genetics, 91(2):224–237. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] Li B and Leal SM (2008). Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. The American Journal of Human Genetics, 83(3):311–321. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] Li M, Zhou W, Yuan R, Chen L, Liu T, Huang D, Hao L, Xie Y, and Shao J (2015). Rock2 promotes hcc proliferation by cebpd inhibition through phospho-gsk3β/β-catenin signaling. FEBS letters, 589(9):1018–1025. [DOI] [PubMed] [Google Scholar]

[R20] Liu Y, Chen S, Li Z, Morrison AC, Boerwinkle E, and Lin X (2019). Acat: A fast and powerful p value combination method for rare-variant analysis in sequencing studies. The American Journal of Human Genetics, 104(3):410–421. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] Lumley T (2011). Complex surveys: a guide to analysis using R, volume 565 John Wiley & Sons. [Google Scholar]

[R22] Madsen BE and Browning SR (2009). A groupwise association test for rare mutations using a weighted sum statistic. PLoS genetics, 5(2):e1000384. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] Mahajan A, Sim X, Ng HJ, Manning A, Rivas MA, Highland HM, Locke AE, Grarup N, Im HK, Cingolani P, et al. (2015). Identification and functional characterization of g6pc2 coding variants influencing glycemic traits define an effector transcript at the g6pc2-abcb11 locus. PLoS genetics, 11(1):e1004876. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] Meinshausen N, Meier L, and Bühlmann P (2009). P-values for high-dimensional regression. Journal of the American Statistical Association, 104(488):1671–1681. [Google Scholar]

[R25] Minica CC, Genovese G, Hultman CM, Pool R, Vink JM, Neale MC, Dolan CV, and Neale BM (2017). The Weighting is the Hardest Part: On the Behavior of the Likelihood Ratio Test and the Score Test Under a Data-Driven Weighting Scheme in Sequenced Samples. Twin Res Hum Genet, 20(2):108–118. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] Morgenthaler S and Thilly WG (2007). A strategy to discover genes that carry multi-allelic or mono-allelic risk for common diseases: a cohort allelic sums test (cast). Mutation Research/Fundamental and Molecular Mechanisms of Mutagenesis, 615(1):28–56. [DOI] [PubMed] [Google Scholar]

[R27] Morrison A, Voorman A, Johnson A, Liu X, Yu J, Li A, Muzny D, Yu F, Rice K, Zhu C, Bis J, Heiss G, Donnell C, Psaty B, Cupples L, Gibbs R, and Boerwinkle E (2013). Whole-genome sequence-based analysis of high-density lipoprotein cholesterol. Nature Genetics, 45(8):7–11. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] Morrison AC, Huang Z, Yu B, Metcalf G, Liu X, Ballantyne C, Coresh J, Yu F, Muzny D, Feofanova E, Rustagi N, Gibbs R, and Boerwinkle E (2017). Practical Approaches for Whole-Genome Sequence Analysis of Heart- and Blood-Related Traits. The American Journal of Human Genetics, 100(2):1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] Purcell SM, Moran JL, Fromer M, Ruderfer D, Solovieff N, Roussos P, Odushlaine C, Chambert K, Bergen SE, Kähler A, et al. (2014). A polygenic burden of rare disruptive mutations in schizophrenia. Nature, 506(7487):185. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] R Core Team (2019). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. [Google Scholar]

[R31] Singh T, Walters JT, Johnstone M, Curtis D, Suvisaari J, Torniainen M, Rees E, Iyegbe C, Blackwood D, McIntosh AM, et al. (2017). The contribution of rare variants to risk of schizophrenia in individuals with and without intellectual disability. Nature genetics, 49(8):1167. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] Soliman H, Nyamandi V, Garcia-Patino M, Varela JN, Bankar G, Lin G, Jia Z, and MacLeod KM (2015). Partial deletion of rock2 protects mice from high-fat diet-induced cardiac insulin resistance and contractile dysfunction. American Journal of Physiology-Heart and Circulatory Physiology, 309(1):H70–H81. [DOI] [PubMed] [Google Scholar]

[R33] Splansky GL, Corey D, Yang Q, Atwood LD, Cupples LA, Benjamin EJ, D’Agostino RB Sr, Fox CS, Larson MG, Murabito JM, et al. (2007). The third generation cohort of the national heart, lung, and blood institute’s framingham heart study: design, recruitment, and initial examination. American journal of epidemiology, 165(11):1328–1335. [DOI] [PubMed] [Google Scholar]

[R34] Sun R, Hui S, Bader GD, Lin X, and Kraft P (2019). Powerful gene set analysis in gwas with the generalized berk-jones statistic. PLoS genetics, 15(3):e1007530. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R35] Tanaka T, Nishimura D, Wu R-C, Amano M, Iso T, Kedes L, Nishida H, Kaibuchi K, and Hamamori Y (2006). Nuclear rho kinase, rock2, targets p300 acetyltransferase. Journal of Biological Chemistry, 281(22):15320–15329. [DOI] [PubMed] [Google Scholar]

[R36] Teng S, Thomson PA, McCarthy S, Kramer M, Muller S, Lihm J, Morris S, Soares D, Hennah W, Harris S, et al. (2018). Rare disruptive variants in the disc1 interactome and regulome: association with cognitive ability and schizophrenia. Molecular psychiatry, 23(5):1270. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R37] Urrutia E, Lee S, Maity A, Zhao N, Shen J, Li Y, and Wu MC (2016). Rare variant testing across methods and thresholds using the multi-kernel sequence kernel association test (MK-SKAT). Statistics and its interface, 8(4):495–505. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R38] Voorman A, Brody J, Chen H, Lumley T, Davis B (2013). seqMeta: an R package for meta-analyzing region-based tests of rare DNA variants. Retrieved from https://CRAN.R-project.org/package=seqMeta

[R39] Wessel J, Chu AY, Willems SM, Wang S, Yaghootkar H, Brody JA, Dauriz M, Hivert M-F, Raghavan S, Lipovich L, et al. (2015). Low-frequency and rare exome chip variants associate with fasting glucose and type 2 diabetes susceptibility. Nature communications, 6:5897. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R40] Wu MC, Lee S, Cai T, Li Y, Boehnke M, and Lin X (2011). Rare-variant association testing for sequencing data with the sequence kernel association test. American Journal of Human Genetics, 89(1):82–93. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R41] Wu MC, Maity A, Lee S, Simmons EM, Harmon QE, Lin X, Engel SM, Molldrem JJ, and Armistead PM (2013). Kernel machine snp-set testing under multiple candidate kernels. Genetic epidemiology, 37(3):267–275. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R42] Yoshikawa T, Wu J, Otsuka M, Kishikawa T, Ohno M, Shibata C, Takata A, Han F, Kang YJ, Chen C-YA, et al. (2015). Rock inhibition enhances microrna function by promoting deadenylation of targeted mrnas via increasing paip2 expression. Nucleic acids research, 43(15):7577–7589. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R43] Zhan X, Plantinga A, Zhao N, and Wu MC (2017a). A fast small-sample kernel independence test for microbiome community-level association analysis. Biometrics, 73(4):1453–1463. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R44] Zhan X, Zhao N, Plantinga A, Thornton TA, Conneely KN, Epstein MP, and Wu MC (2017b). Powerful genetic association analysis for common or rare variants with high-dimensional structured traits. Genetics, 206(4):1779–1790. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Convex Combination Sequence Kernel Association Test for Rare Variant Studies

Daniel Posner

Honghuang Lin

James B Meigs

Eric D Kolaczyk

Josée Dupuis

Abstract

1. Introduction

2. Methods

2.1. Convex-optimized SKAT (cSKAT)

2.2. SNV Annotations

Table 1:

Table 2:

2.3. Specification of Kernel Matrices

2.4. Type I Error and Power

Table 3:

Table 4:

2.5. Analysis in the Framingham Heart Study

3. Results

3.1. Simulation Results

Table 5:

Figure 1:

Figure 2:

Figure 3:

3.2. Results in FHS

Figure 4:

Figure 5:

Table 6:

Table 7:

4. Discussion

5. Acknowledgments

A. Appendix

A.1. cSKAT Optimization

A.2. Comparison of SNV weights used in FHS

Figure 6:

A.3. Weight Misspecification

Figure 7:

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases