Abstract
Genome-wide Association Studies (GWAS) methods have identified individual single-nucleotide polymorphisms (SNPs) significantly associated with specific phenotypes. Nonetheless, many complex diseases are polygenic and are controlled by multiple genetic variants that are usually non-linearly dependent. These genetic variants are marginally less effective and remain undetected in GWAS analysis. Kernel-based tests (KBT), which evaluate the joint effect of a group of genetic variants, are therefore critical for complex disease analysis. However, choosing different kernel functions in KBT can significantly influence the type I error control and power, and selecting the optimal kernel remains a statistically challenging task. A few existing methods suffer from inflated type 1 errors, limited scalability, inferior power or issues of ambiguous conclusions. Here, we present a new Bayesian framework, BayesKAT (https://github.com/wangjr03/BayesKAT), which overcomes these kernel specification issues by selecting the optimal composite kernel adaptively from the data while testing genetic associations simultaneously. Furthermore, BayesKAT implements a scalable computational strategy to boost its applicability, especially for high-dimensional cases where other methods become less effective. Based on a series of performance comparisons using both simulated and real large-scale genetics data, BayesKAT outperforms the available methods in detecting complex group-level associations and controlling type I errors simultaneously. Applied on a variety of groups of functionally related genetic variants based on biological pathways, co-expression gene modules and protein complexes, BayesKAT deciphers the complex genetic basis and provides mechanistic insights into human diseases.
Keywords: Kernel based test, Genetic association test, Bayesian model, Optimal kernel selection, Complex diseases
INTRODUCTION
Deciphering the genetic basis of complex traits, such as the Alzheimer’s disease, plays pivotal roles in functional genomics and precision medicine [1, 2]. Based on the advancement in high-throughput sequencing techniques, specific associated genetic variants, e.g. single-nucleotide polymorphisms (SNPs), have been identified for a large panel of phenotypes using Genome-wide Association Studies (GWAS) [3]. However, traditional GWAS approaches treat SNPs independently and can only discover individual SNPs that have strong marginal statistical associations with the phenotype of interest. It is well documented that many complex diseases and phenotypes are often associated with multiple genetic variants [4–7] where an individual variant itself might be weakly associated with the phenotype. In contrast, groups of such SNPs may jointly contribute to the phenotype, potentially mediated via their cooperative participation in important biological processes or pathways [8, 9]. Therefore, the traditional GWAS framework of testing individual SNPs separately without considering the correlation structures and the potential interactions among SNPs may not capture the group-wise joint SNP effects. Separate testings of SNP associations by traditional GWAS approaches are also limited to reveal the underlying biological mechanisms of complex phenotypes. Alternative approaches based on multivariate regression significantly suffer from the large degrees of freedom in genome-wide association tests and can substantially lose the statistical power [10].
To overcome this critical challenge, kernel-based testing (KBT) framework has been introduced to test group-wise joint SNP effects [10–16]. By incorporating a kernel function to measure the similarity among genetic variants and compare with the phenotype similarities, the KBT framework simultaneously models the joint effects of multiple genetic variants. Wu et al. [12] first proposed the widely used sequence kernel association test (SKAT) model to test rare-variant associations. As a supervised, versatile and computationally streamlined regression approach, SKAT accesses the associations between genetic variants within a specific region and the trait. As the outputs from the SKAT model, P-values of the statistical associations are generated, facilitating straightforward interpretations of the findings. An R package [17] has been developed for implementing different kinds of KBT models, including SKAT.
To enable novel discoveries of the genetic basis underlying complex diseases, maximizing statistical power in genome-wide association tests while effectively controlling type 1 errors is strongly desired. Under the KBT framework, statistical power heavily depends on the specific choice of kernel functions [16, 18–20]. However, the existing KBT models, including SKAT, require the kernel function to be specified a priori. Because the true functional relationship between the genetic variants and phenotypes is usually unknown in practice, selecting the ideal kernel function in advance for the KBT model, one that maximizes statistical power without increasing the type 1 error rate, poses statistical and computational challenges. One common approach that has been used is to repeat the KBT procedures based on different choices of kernels and then select the one resulting in the minimum P-value, which has been discussed by multiple studies [18, 21]. The major problem of this straightforward approach is the inflated type 1 error. Although data-dependent permutation or perturbation methods [18] can help ease the problem, they are not computationally scalable, especially when applied to high-dimensional datasets in large-scale genomic studies. An alternative approach is to use an equal-weighted average of multiple candidate kernels to form an averaged composite kernel [18], which performs better than the worst-performing candidate kernel but does not usually achieve the performance of the optimal kernel function. Tests based on the average kernel approach may lead to inconsistent or incorrect conclusions in applications, as we will demonstrate below. He et al. [21] proposed a maximum kernel test model based on the U statistic, i.e. the mKU model, which claims to achieve the statistical power as close as to the best candidate kernel in high-dimensional settings under certain distributional assumptions. However, the specific distributional assumptions may not hold in practice and hence lead to inflated P-values, which will also be discussed in this study.
To further illustrate the significance and difficulty of choosing appropriate kernels in genetic association testings, Figure 1A demonstrates an example based on the genotype data for the trait of whole brain volume collected from the ADNI project for the Alzheimer’s Disease Neuroimaging Initiative [Data used in preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu)]. As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report. A complete listing of ADNI investigators can be found at: http://adni.loni.usc.edu/wp-content/uploads/how_to_apply/ADNI_Acknowledgement_List.pdf (https://adni.loni.usc.edu/) [22]. The group of genetic variants located in genes belonging to the caffeine metabolism pathway [23–25] are included into the KBT model to test the hypothesis that whether the caffeine metabolism pathway is associated with the whole brain volume phenotype. Using different kernel functions, the SKAT model leads to inconsistent conclusions. For instance, based on Quadratic kernel, the SKAT model rejects the null hypothesis (P-value<0.05), while the use of Gaussian kernel or IBS kernel does not lead to any rejection of the null hypothesis Figure 1A. On the other hand, using the equal-weighted average composite kernel, the SKAT model tends to reject the null hypothesis (P-value
). Since there is no clear mechanistic link between the caffeine metabolism pathway and whole brain volume, the rejection of the null hypothesis based on the Quadratic and the average composite kernels is likely a false discovery. To further quantify this issue, in Figure 1B, a cohort of total 500 replicate synthetic datasets based on the same covariates and genotype data is created, where the phenotype variables are generated by a Quadratic function
. Applying the SKAT test based on different kernel functions on the synthetic datasets, inconsistent testing results appear to be a persistent problem. Although the Quadratic kernel function leads to the correct hypothesis testing result as expected, the average composite kernel usually leads to incorrect conclusions (Figure 1B). As shown in the barplot of Figure 1B, using different kernels (Linear, Quadratic, Gaussian and average composite kernels) across the 500 synthetic datasets, inconsistent results are observed and the overall fractions of correct testing results are very low. Hence, the inconsistent conclusions based on different kernels suggest the fundamental need of developing a systematic data-adaptive approach of selecting appropriate kernel functions for KBT models in genome-wide association tests.
Figure 1.

Overview of the significance and model design for BayesKAT. (A) A real-world example of an association test with different pre-specified individual kernels(IBS, Gaussian or Quadratic) or the simple average of individual kernels (the Average kernel) leads to inconsistent results. To test the association between the caffeine metabolism pathway and the phenotype of whole brain volume, choosing different kernel functions (or the Average kernel) yield disparate P-values, introducing ambiguity in the final conclusion. In contrast, BayesKAT offers a more interpretable metric (the posterior probability of association) to draw the final conclusion. (B) In a synthetic data cohort simulated assuming a true quadratic function, different kernel functions lead to inconsistent results. Across 500 simulation replicates, each kernel, including the quadratic kernel, yields inconsistent and ambiguous conclusions (barplot). In comparison, BayesKAT generates both interpretable and highly consistent results, with substantially boosted power. (C) Workflow of BayesKAT implementation for diverse types of genetic association tests to derive biological meaningful interpretations. (D) Model structures and the inference algorithms for the two BayesKAT strategies: BayesKAT-MCMC (left) and BayesKAT-MAP (right). BayesKAT-MCMC samples from posterior parameter distributions, providing a comprehensive view of the posterior parameter distributions. On the other hand, BayesKAT-MAP provides a more scalable solution, particularly well-suited for high-dimensional data.
In this study, we developed a novel Bayesian Kernel-Based Association Testing algorithm, BayesKAT (https://github.com/wangjr03/BayesKAT). This algorithm effectively tackles the kernel selection challenge by choosing the optimal kernel in a data-adaptive way and calculating the posterior probability of association by evaluating the joint statistical associations of specific SNP groups with a complex phenotype. Moreover, compared with existing KBT-based methods, BayesKAT simultaneously achieves four goals in genome-wide association tests: (i) superior statistical power, by selecting the optimal kernel function based on the dataset under study; (ii) consistent results, by avoiding repeated tests based on a variety of different kernels; (iii) controlled type-1 error, without relying on unverified distributional assumptions or minimum P-value kernels and (iv) strong computational scalability for high-dimensional and large-sample genome-wide data. Two alternative computational strategies, i.e. MCMC and MAP, are incorporated in BayesKAT, leading to additional implementation flexibilities for users. Extensively tested on a series of simulated datasets under different parameter settings, BayesKAT consistently demonstrates superior performance against existing methods. Furthermore, applied on the ADNI genotype datasets of the complex trait of whole brain volume (https://adni.loni.usc.edu), BayesKAT successfully discovered mechanistically related genes and biological pathways with higher accuracy. Specific genes and pathways related with neurodegenerative diseases, as reported by previous studies, are consistently prioritized by BayesKAT while not prioritized by other methods. Strikingly, BayesKAT is able to identify group-level SNP effects of novel co-expressed gene modules and protein complexes that potentially participate in the molecular processes modulating the whole brain volume phenotype. These algorithmic advantages and new biological discoveries robustly support the statistical innovation of BayesKAT and strongly highlight its effectiveness in decoding the genetic basis and associated molecular mechanisms underlying complex diseases.
MATERIALS AND METHODS
Overview of KBT models for genetic data
Under the kernel machine regression framework, continuous quantitative traits can be associated with genetic variants or molecular features, along with additional covariates, through a semiparametric model:
![]() |
(1) |
where
denotes the continuous value of the trait for the
th person in a sample of size
;
is a set of
covariates for the
th individual that need to be controlled and
are the corresponding effects of covariates.
is the vector for the
genetic variants or molecular features, where
denotes the
th genetic variant or molecular level feature for the
th individual. The unknown errors
are assumed to be independent and follow
, where the value
is also unknown. The most common genetic features are SNPs and the widely used molecular-level features include gene expressions. The features, i.e.
are associated with the trait, i.e.
, through an arbitrary function
which is assumed to lie in a function space
generated by a kernel function
. The supplementary files provide More discussion about different kernel function types.
It has been shown [11] that the kernel machine regression model in equation (1) is equivalent to the following linear mixed model:
![]() |
(2) |
where
is a vector of effect sizes for covariates
,
is an
vector of random effects which is distributed as
, where
is the
kernel matrix and the error is distributed as
, where
is the error variance and
is a variance component for the genetic effect.
The main goal is to test if the genetic variants have any combined effect on the outcome variable
. Testing for the presence of group effect of
is equivalent to testing the hypothesis
versus
. Choosing an appropriate kernel is crucial and that topic has been discussed further in the supplementary files.
BayesKAT
Our new algorithm, BayesKAT (https://github.com/wangjr03/BayesKAT), employs a novel Bayesian modeling strategy to automatically select the optimal composite kernel based on the data and does not require the composite kernel function to be set a priori by the user. Based on the inferred optimal composite kernel function, BayesKAT can efficiently test the joint effects induced by a group of genetic or molecular features associated with a phenotype. The optimal composite kernel is a linear combination of candidate kernels where the weight of each candidate kernel reflects the degree of usefulness of the kernel explaining the complex relationship between a group of features and the phenotype of interest. As an illustration, suppose there are three potential kernels: Quadratic, Gaussian and IBS. If the IBS kernel effectively captures the underlying relationship, it will carry greater significance within the composite kernel, hence have a larger weight, while the impact of other kernels may be relatively weak, as indicated by their lower weights. Figure 1C provides an overview of the workflow of BayesKAT and its two computational strategies: (1) the Markov Chain Monte Carlo (MCMC) strategy; and (2) the Maximum a posteriori (MAP) strategy. Additionally, it is noteworthy that while BayesKAT is primarily developed to test genetic associations, it can also be employed for a wide range applications, including testing the association between continuous gene expression features and complex traits.
Consider a set of
candidate kernels
, the composite kernel is in the form of
, where
and
= 1 and
![]() |
Therefore, selecting the optimal composite kernel is equivalent to selecting the optimal value for the weight
so that it can capture the underlying relationship between the genetic or molecular features and the trait, when testing the group-level effect of a set of multiple features.
As a KBT model, BayesKAT relies on a set of candidate kernel functions, which are incorporated to infer the optimal composite kernel for the association tests. For the convenience of practical implementations, BayesKAT infers the optimal composite kernel consisting of three candidate kernels as the default setting. And the default candidate kernels include Quadratic, Gaussian and IBS kernel. To construct a composite kernel, the candidate kernels are normalized in BayesKAT based on the previously proposed technique [21] so that they are in the same scale and comparable.
To gain robust performance, weakly informative prior distributions are used for model parameters by default, although the users can incorporate more informative priors based on specific knowledge about the data. The important model parameters are
, where
are the unscaled weights of the candidate kernels (
1),
and
after reparameterization. And the weakly informative prior distributions are
![]() |
The actual weights for candidate kernels
. Clearly
and
. The data distribution is defined as
![]() |
(3) |
where the composite kernel
.
BayesKAT strategy
Here the main hypothesis to test is
versus
. It is equivalent to test
versus
. Bayes factor(
) is calculated to test the hypothesis, which evaluates the evidence in favor of the alternative hypothesis. Bayes factor is defined as the ratio of marginal likelihoods under two hypotheses [26]:
![]() |
(4) |
where
and
are the marginal likelihoods under
and
, respectively. Given the input data, BayesKAT mainly uses two efficient and easy-to-implement strategies to select the composite kernel and calculate the Bayes Factor
by estimating
and
. The two computational strategies are explained in subsequent sections.
Interpreting BayesKAT output
The preliminary output from BayesKAT,
, is a summary of evidence provided by the data in favor of
as opposed to
. In addition, the posterior probability of the association (
) is calculated as the final output, which has a one–one relation with
, i.e.
![]() |
Here
and
are the prior probabilities under
and
, respectively, i.e. the probabilities of existence of association and no association, respectively. Depending on the specific biological problem and dataset, the values of
and
can be set given biological evidence or prior knowledge, and
can be calculated.
is the posterior probability of the model under
given the data.
gives a quantitative evaluation of how probable there exists an association between the genetic features and the phenotype of interest or how strong the evidence is against the null hypothesis.
As already demonstrated by the example in Figure 1A and 1B in the introduction section, existing methods based on pre-selected kernels or average composite kernels suffer from inconsistent hypothesis testing results and may lead to false discoveries. In contrast, for the Figure 1A scenario, BayesKAT calculated the posterior probability of a true association given the data P(
)=0.19, i.e. the evidence of association is very low and the existence of true association is not very likely. This is consistent with the fact that there is a lack of documented evidence of the association between the caffeine metabolism pathway and the whole brain volume phenotype. Strikingly, in the scenario of the synthetic data cohort simulated with known association based on the Quadratic kernel (see Figure 1B), BayesKAT successfully calculated the posterior probability
to suggest the existence of association, without incorporating any prior information. Moreover, tested on 500 repeated simulation cohorts, BayesKAT achieves much higher statistical power than other methods and also demonstrates more consistent testing results (see Figure 1B). Investigating the inferred weights for each candidate kernel functions in the final composite kernel selected by BayesKAT further shows that the Quadratic kernel is correctly assigned with the largest weights (Supplementary Figure S1), suggesting that BayesKAT can efficiently capture the functional form of the underlying statistical associations in a data adaptive way.
BayesKAT-MCMC
As a Bayesian model, BayesKAT-MCMC employs the MCMC sampling-based strategy to infer the optimal composite kernel function. Leveraging MCMC for efficient and traceable samplings from complex target distributions, BayesKAT-MCMC avoids direct sampling from the posterior distribution
(see section 2.5), which does not have a closed mathematical form and is computationally intractable. Instead, Metropolis–Hastings method [27, 28] is used to iteratively draw samples based on the generated Markov chain, which are able to approximate the target probability distribution
.
Let
denote the initial value for
. The
th iteration of the Metropolis–Hastings algorithm consists of the following steps [29, 30]:
Sample a candidate point
from a proposal distribution
.Calculate the acceptance ratio for jumping to the new point

Set
with probability
and
with probability
. That is, it jumps to the new proposed value with probability
and stays at the same value with probability
.
Here is the main workflow for BayesKAT-MCMC:
Input: Genotype matrix
, covariate matrix
, response
. Parameters under
:
Parameters under
:
Prior distribution of
as mentioned in ‘BayesKAT’ section. Define the data distribution based on
,
,
and
, as defined in (3) where
0 or 1Step 1: Using the Metropolis–Hastings MCMC method mentioned above, draw samples from the posterior distribution of
using three separate MCMC chains, each of which ran 50 000 iterations.Step 2: check if the algorithm has converged, otherwise run more iterations.
Step 3: Draw samples from the posterior distribution of
by similarly Repeating steps 1 and 2.Step 4: Using the posterior samples of
, the unknown parameters are estimated:
. From
, the optimal kernel weights
can be estimated.Step 5: the posterior samples of
are used to calculate
, respectively, using Chib’s method[31].Step 6: Bayes Factor calculated from
using (4).
BayesKAT-MCMC uses the R package BayesianTools [32] to generate two sets of samples from the posterior distribution of
under the hypotheses
and
, using the Metropolis–Hastings algorithm in an adaptive way [33] to leverage the history of the stochastic process and appropriately fine-tune the proposal distributions. Three separate MCMC chains initiated from different random start points are generated for 50 000 iterations. To ensure that the MCMC chains are converged, trace plot is used to visualize the moves of the Markov chains in the state space [34]. In addition, based on the Gelman–Rubin diagnostic method [35], the potential scale reduction factor, i.e. PSRF score, is also calculated and presented at the end of MCMC sampling to inspect whether the chain is converged in which the PSRF score is close to one.
Based on the generated posterior samples, the marginal distributions of the parameters are further visualized as shown in Figure 1D. The posterior samples are used to estimate the composite kernel weights and also the marginal likelihoods
using the Chib’s method [31]. Bayes Factor is subsequently calculated using the formula presented in equation (4).
BayesKAT-MAP
Although BayesKAT-MCMC yields comprehensive information of the marginal distributions of model parameters, drawing large sets of samples from the posterior distributions in the MCMC strategy is computationally expensive. Here, we provide an alternative strategy, termed BayesKAT-MAP, which is easy to implement, allows parallel calculations and has higher computational scalability. Instead of drawing numerous MCMC samples from the posterior distributions and then estimating the parameter values, BayesKAT-MAP employs the quick optimization technique to estimate the parameters of interest directly, based on the MAP strategy such that
![]() |
(5) |
where
denotes the prior distribution
. Because the objective function is nondifferentiable at some points, a derivative-free optimization algorithm by Quadratic approximation using the R package Minqa [36] is implemented. The most important model parameter
indicates the existence of association between the feature set and the trait variable, with
suggesting that there is no association. In BayesKAT-MAP, if the calculated MAP estimator of
, i.e.
, it follows that the Bayes Factor =0 (i.e. no evidence of association is found), and the computational process terminates. On the other hand, if
, it implies that there might be some evidence of association and BayesKAT-MAP proceeds to calculate the MAP estimator again under
and then computes the marginal likelihoods
, along with the Bayes Factor.
Due to the practical limitations of exact analytical methods, such as relying on specific distributional assumptions, efficient numerical integration approaches [37] are needed to calculate the marginal likelihoods under hypotheses
,
= 

, so that the model can be applied on diverse panels of data. BayesKAT-MAP employs the Laplace’s method [26, 38, 39] for approximating the integral
by
, where
and
is the dimension of
,
is the mode of the log-likelihood function
and
is the inverse of the negative Hessian matrix of the second derivative of
computed at
. For boundary regions of the parameter space, the Laplace approximation is modified according to the previously developed protocol [40] to accommodate the boundary cases. Based on the estimated marginal likelihood densities, the Bayes Factor is then computed and the posterior probabilities are finally inferred, given user-defined priors
and
for which non-informative priors are used as the default setting in BayesKAT.
Here is the summary of the workflow for BayesKAT-MAP:
Input: Genotype matrix
, covariate matrix
, response
. Parameters under
:
Parameters under
:
Prior distribution of
as mentioned in ‘BayesKAT’ section. Define the data distribution based on
,
,
and
as mentioned in (3), where
0 or 1Step 1:
is calculated using this formulation 5 using optimization technique.Step 2: check if
. If 0, stop. conclusion: no association! If >0, go to the next step.Step 3: Calculate
using same technique in step 1 under
.Step 4: Calculate
using Laplace approximation(look at previous paragraph) based on
and
.Step 5: Bayes Factor calculated from
using (4).
The performance and runtime of BayesKAT-MCMC and BayesKAT-MAP under different settings are systematically compared and further discussed in supplementary files. Because of the superior computational scalability of BayesKAT-MAP, as shown in 1D, the results in the paper are generated using BayesKAT-MAP.
Input data organization and model setup for BayesKAT
Data containing the information of genotypes or molecular features for complex phenotypes can be collected from large public-accessible or user-generated cohorts (e.g. ADNI, UK biobank, All of Us (https://allofus.nih.gov/), GTEx, PsychENCODE (https://psychencode.synapse.org/), etc.). Individual-level data can be pre-processed and efficiently undergo steps of quality controls using software such as Plink [41]. Additionally, biological meaningful feature groups need to be defined and created depending on the goals of genome-wide association tests. In this study, we have explored four different biology inspired ways of grouping functionally related genetic variants, including (1) gene-wise groups: aggregating SNPs located within genomic regions of genes; (2) pathway-level groups: aggregating SNPs situated within genes that belong to a specific molecular pathway; (3) co-expression gene modules: aggregating SNPs located within genes that belong to a specific co-expression module and (4) Protein–protein interaction(PPI) modules: aggregating SNPs located within genes that belong to a specific PPI module, which may represent a protein complex.
Comparison with other methods and performance evaluation
The performance of BayesKAT is compared with two state-of-the-art algorithms: (1) SKAT using the average composite kernel, denoted as SKAT(Avg) [18]; and (2) the U statistic-based method, denoted as mKU [21]. Both of these two methods are frequentist approaches that maximize the power after restricting the type 1 error to a fixed level of
, such as setting
to 0.05, and have been shown to outperform other existing methods. In contrast, as the first Bayesian model for this problem, BayesKAT uses a fixed threshold on the posterior probability or the Bayes factor, based on previously suggested guidelines [26], to reject the null hypothesis. To make fair comparisons, the performance of each method (i.e. the empirical statistical power) is evaluated at a fixed and equal empirical type 1 error across all three algorithms. A systematic comparison based on rigorous simulations is presented in the Results section. As multiple groups are simultaneously tested, a multiplicity correction technique is implemented, which is discussed in detail in the supplementary files.
SIMULATION STUDIES: ENHANCED EFFICACY OF BAYESKAT BENCHMARKED ON SIMULATION STUDIES
As defined in the Materials and Methods section, the rows of the feature matrix
correspond to
individuals and the columns correspond to the
features. Depending on the particular genetic association studies, the features may encompass discrete genetic characteristics, like alleles with values of 0, 1 or 2, or continuous molecular features, such as gene expressions. We have conducted simulations for both cases, with
corresponding to discrete or continuous features, under both low- and high-dimensional settings.
Simulation with continuous features
As shown in Figure 2A, the simulations based on continuous features are first conducted to evaluate the performance of BayesKAT using similar scenarios presented in [21]. With the specified parameters (
,
,
,
), the feature matrix
is simulated from a multivariate normal distribution with mean
and an AR(1) correlation matrix
where
. In the simulation, the correlation
is set to be 0.6. The covariate matrix
has one binary covariate generated from a Bernoulli (0.6) distribution and one continuous covariate generated from N(2,1). The covariate coefficients are set as
and the outcomes
are simulated based on model (1). The error term with variance
is added as the random noise. Three commonly used candidate kernels are incorporated in BayesKAT: Linear, Quadratic and Gaussian. Different functional forms of
are used to create different scenarios to test the performance. The empirical type 1 errors of each method are calculated based on the specific scenario where
, i.e. there is no association between
and
in the simulated data. The different scenarios are given below:
Figure 2.
Performance comparison based on simulations using continuous features. (A) Schematic summary of the data generation process for continuous molecular features or variables (e.g. gene expression features) and the demonstration of the implementation under various scenarios. (B) Performance comparison across different simulation settings with systematic performance evaluations, i.e. the empirical power versus empirical type 1 error, for SKAT(Avg), mKU and BayesKAT across different scenarios and parameter settings. When increasing
(no of features) while keeping other factors constant, all methods exhibit a slight decline in power, but BayesKAT consistently outperforms SKAT(Avg) and mKU. Additionally, as
(correlation coefficient between features) increases under other conditions fixed, BayesKAT also consistently surpasses SKAT(Avg) and mKU.
Scenario A:

Scenario B:

Scenario C:

In all the simulation scenarios, as shown in Figure 2B, the empirical power versus empirical type 1 error for BayesKAT is consistently better than that of SKAT(Avg) and mkU. Moreover, sensitivity analyses are conducted to evaluate the effects with different simulation parameters on the final performance of different algorithms. Notably, when the number of features
increases from 100 to 150, while keeping all other parameters fixed, BayesKAT consistently achieves superior empirical power compared with other methods (see Figure 2B). Similarly, when the correlation
increases from 0.6 to 0.8 with
fixed as 150, BayesKAT consistently demonstrates higher empirical power and outperforms other methods. Taken together, these simulation results demonstrate the robust superior performance of BayesKAT compared with SKAT(Avg) and mKU in various settings with continuous features.
Simulation with discrete features
The performance of different models on discrete SNP features is first evaluated based on simulations where randomly selected SNP groups are used as features. Because randomly selected SNPs are generally not functionally related, the overall effectiveness of all KBT models decreases as expected, with BayesKAT still showing improved empirical power compared with other methods (Supplementary Figure S2). Because the real-world implementations of KBT models for genetics studies usually focus on functionally related groups of SNPs, a more realistic strategy of simulating groups of discrete SNP features, instead of randomly selected unrelated SNPs, is employed to further benchmark the performance of BayesKAT (Figure 3A).
Figure 3.
Performance comparison based on simulations with discrete features. (A) Schematic summary of the data generation process for discrete genetic features (e.g. SNP variants) and the demonstration of the implementation under various scenarios. Starting with the original
matrix containing groups of SNPs from each of the three randomly selected pathways, a simulated SNP matrix
is generated, preserving the underlying interrelationships among the SNPs. Covariates (i.e. age, gender and education) are incorporated based on the data from ADNI. (B) Performance comparison was conducted across different simulation settings with sample size n=755 (same as the real data). Systematic performance evaluations are plotted, i.e. the curves of empirical power versus empirical type 1 error averaged over three simulated datasets for SKAT(Avg), mKU, and BayesKAT across different scenarios. The performance curves based on each individual simulated dataset can be found in Supplementary Figure S3.
To rigorously capture the underlying linkage disequilibrium structures of discrete features in real-world SNP data, the ADNI dataset is used as the basis for the simulations, where the covariate matrix
is created from the real covariates (e.g. age, gender and education) of the corresponding individuals in the dataset, and SNPs located in specific genes belonging to the selected KEGG pathways are included into the testing (see Materials and Methods). Three different KEGG pathways are randomly selected for performance evaluations. For each pathway, the groups of SNPs located in the corresponding gene members are identified. The ‘SNPknock’ package [42] is then used to simulate the knockoff SNP data, which maintains the structural dependency among SNPs in each group intact in the simulated knockoff SNPs (Figure 3A). The feature matrix
is constructed based on the simulated SNP data, with
and
ranging between 4000 and 5000. The outcome variable
is simulated based on three different scenarios. Each scenario corresponds to a different functional form
, that is,
Scenario D:

Scenario E:

Scenario F:

where
is the
th column of
corresponding to the
th SNP. For each group of pathway-level knockoff SNPs, 500 simulations are generated. By applying BayesKAT and the other methods on the set of simulations, the corresponding empirical type 1 error and empirical power are calculated accordingly. The summary of performance comparisons based on this extensive set of simulations is shown in Supplementary Figure S3. The empirical power and empirical type 1 error for each SNP group clearly demonstrates that BayesKAT robustly outperforms SKAT(Avg) and mKU, across different simulation scenarios and settings. The averaged performance over three pathway-level SNP groups can be found in Figure 3B. To demonstrate the robust superior performance of BayesKAT, the simulation study is repeated using different sample sizes, n=1000,1500. The comparison outcomes are illustrated in supplementary Figures S4 and S5. Remarkably, in addition to these consistent advantages, BayesKAT also achieves much lower empirical type 1 error across all simulation settings, when the suggested Bayes Factor threshold [26] is employed. It suggests that the associations between the SNP groups and the phenotype detected by BayesKAT exhibit a significantly higher level of reliability compared with other methods, a crucial attribute for genetic applications in complex diseases.
APPLICATION TO ADNI DATASETS:: BAYESKAT REVEALS NOVEL ASSOCIATED GENETIC BASIS OF COMPLEX TRAITS
To illustrate the novel biological insights generated by BayesKAT, the individual-level data, including the genotype, phenotype and demographic covariates, from the ADNI project (https://adni.loni.usc.edu) [22, 43] are used to conduct a series of group-level genetic association testings. Specifically, BayesKAT is used to test the group-wise associations between SNP sets and the complex phenotype of whole brain volume, based on the available information across 755 individuals in the ADNI cohort.
Real data preprocessing
In addition to a series of simulated datasets, real genetic datasets are used to evaluate the performance of BayesKAT and its derived biological discoveries. The data used in the preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). The ADNI was launched in 2003 as a public–private partnership led by Principal Investigator Michael W. Weiner, MD. The primary goal of ADNI has been to test whether serial magnetic resonance imaging, positron emission tomography, other biological markers and clinical and neuropsychological assessments can be combined to measure the progression of mild cognitive impairment and early Alzheimer’s disease. Plink software [41] (http://pngu.mgh.harvard.edu/purcell/plink/) is used to pre-process the individual-level genotype data. Detailed information on data access, download and pre-processing steps can be found in the supplementary material.
Four biology-based strategies are used to create the feature groups of functionally related SNPs, leading to complementary biological insights into the genetic basis underlying the specific phenotype. The four SNP feature grouping strategies are: (1) Gene-wise SNP groups for 18 999 protein-coding genes in the human genome [44]. For each protein-coding gene, all SNPs located within +/-5KB of the gene body are collected as the gene-wise SNP feature set; (2) Pathway-wise SNP groups for 352 KEGG pathways [23–25]. For each pathway, gene-wise SNP groups for all genes belonging to the specific pathway are collected as the pathway-wise SNP feature set. The number of SNPs per pathway varies from 35 to 22 555, with a mean of 1721 SNPs. Figure 5A provides a schematic figure demonstrating the pathway-wise joint SNP testing procedure; (3) Co-expression gene module based SNP groups. A total of 41 co-expression gene modules are identified using the R package ‘WGCNA’ [45, 46] to find correlated gene co-expression clusters from expression data (available on the ADNI website). The co-expression gene modules have a different number of genes in them, which varies from 6 to 3361. SNPs within each module are extracted for further group-wise testing and (4) 401 PPI gene module based SNP groups. The PPI gene modules are previously created in [47] based on the topology of the PPI network. The number of genes in PPI modules ranges from 2 to 497.
Figure 4.

Functional validation of BayesKAT’s prioritized genes using orthogonal information. (A) Top-ranking genes prioritized by BayesKAT are strongly supported by previous literature of functional studies of brain-related diseases. (B) The selected genes by BayesKAT demonstrate higher fractions of overlapping meQTL’s CpG sites than the genes selected by SKAT(Avg) and mKU. The CpG sites of significant meQTLs from the brain tissues represent orthogonal molecular-level evidence in support of the gene’s functional involvement with whole brain volume.
BayesKAT prioritizes functionally related genes
To identify genes associated with the trait of whole brain volume, we conduct a gene-wise association test with gene-level SNPs (see Materials and Methods). BayesKAT prioritized 17 genes, whose posterior probability of association (
) is greater than 0.7. Figure 4A shows some examples of the prioritized genes, which have been suggested to be associated with brain or neurodegenerative disorders by previous seminal studies [48–50]. The whole list of the 17 prioritized genes and their corresponding posterior probability of associations are provided in Supplementary Table S2. Strikingly, as external evidence in support of BayesKAT’s prioritized genes, 12 out of the 17 genes (
) contain CpGs that have been found to be involved with significant meQTLs [51] in the human brain cortex. In contrast, the genes prioritized by SKAT(Avg) and mKU demonstrate much lower fractions of overlapping with CpGs linked to meQTLs (
and
respectively, Figure 4B). The higher proportion of prioritized genes containing CpGs offers molecular-level support for BayesKAT’s ability to uncover genes mechanistically linked to complex traits.
Figure 5.
Pathway-level association tests by BayesKAT prioritizes neurodegenerative disease related pathways. (A) Schematic representation illustrating the steps of pathway-level association tests. Sets of SNPs located within genes belonging to each of the 352 pathways are tested simultaneously for pathway-level associations with the phenotype of interest (the whole brain volume), and the multiple testing correction is also implemented.(B) The top-ranking pathways prioritized by (i) BayesKAT, (ii) mKU, and (iii) SKAT(Avg) demonstrate distinct enrichment with neurodegenerative disease associated pathways. Top 50 pathways are shown for fair comparison. The top-ranking pathways by BayesKAT are ranked by the posterior probabilities of the pathway-level associations. The top-ranking pathways by the frequentist methods, mKU and SKAT(Avg) are ranked by the -log
(P-values). The pathways highlighted in red are neurodegenerative disease related pathways. The red horizontal dashed line in each bar plot indicates the threshold used by each model for fair comparison (see Materials and Methods). The pie charts illustrate the proportion of the selected pathways (above model’s thresholds) that belong to the neurodegenerative disease pathways. BayesKAT notably exhibits enhanced prioritization of neurodegenerative disease pathways. Due to the issue of inflated P-values in mKU, pathways with P-values of 0 are assigned -log
(P-values)=30 for visualizations.
Biological pathways linked to neurodegenerative diseases are top-ranked by BayesKAT
To identify biological pathways that potentially modulate the whole brain volume trait, BayesKAT is used to analyze pathway-level SNP groups (see Materials and Methods, Figure 5A). The top 50 ranked KEGG pathways by each model are summarized in Figure 5B, where BayesKAT ranks the pathways based on decreasing posterior probability of association (
), while the mKU and SKAT(Avg) methods rank the pathways based on decreasing -log
(P-value). Interestingly, BayesKAT successfully prioritized most of the neurodegenerative disease related pathways with top ranks (Figure 5B). This is a strong mechanistic support to BayesKAT’s results, because the neurodegenerative diseases, including Alzheimer’s disease, Huntington’s disease, Amyotrophic lateral sclerosis and Parkinson’s disease, have been found to be strongly related to brain volume loss [52–55]. Based on a reasonable posterior probability threshold 0.7, there are 21 pathways identified by BayesKAT (Figure 5B, Supplementary Table S3), among which five pathways are associated with neurodegenerative diseases. In comparison, the mKU and SKAT(Avg) models prioritized large numbers of pathways (242 and 72, respectively), while only a small fraction of them are neurodegenerative disease-related pathways (6 and 5, respectively). For SKAT(Avg), these functionally related pathways are not even the top-ranked ones. As summarized in the corresponding pie charts in Figure 5B, BayesKAT achieves the highest efficiency in prioritizing important pathways and resulting in fewer false discoveries than the other methods. These results are also consistent with the larger type 1 errors of mKU and SKAT(Avg) observed from simulation analyses described above. Note that setting the posterior probability threshold is subjective, same as selecting type 1 error threshold or FDR threshold (0.05 or 0.01 or 0.1). Opting for a threshold greater than 0.7 (such as 0.9, 0.95 or 0.99) is also effective, as BayesKAT efficiently prioritizes the top pathways. Overall, the highly prioritized functional relevant pathways imply the novel biological insights that can be generated using BayesKAT.
To further evaluate the performance of BayesKAT in determining the optimal kernel weights, 10 randomly chosen pathways are used to test the pathway-level associations. The resulting composite kernels are compared with the results of using SKAT based on individual kernels separately. The corresponding -log
(P-values) metrics from SKAT using individual kernels are compared with the inferred kernel weights in the composite kernels from BayesKAT. As shown in Figure 6A, the high similarity between the two heatmaps indicates that BayesKAT can efficiently select the optimal composite kernel automatically from the data, without relying on prior knowledge or repetitively trying different individual kernels.
Figure 6.

Boosted association tests based on BayesKAT’s composite kernel reveals novel modules of genes and proteins linked to brain volume. (A) The inferred weights of individual kernels in BayesKAT’s composite kernel (right) recapitulate the strength of each kernel (-log
(P-values)) when each kernel is incorporated separately (left). Without relying on prior knowledge or repetitively testing different kernels separately, BayesKAT automatically infers the appropriate composite kernels to boost the group-level tests for different pathways. (B) Prioritized co-expression gene modules by BayesKAT (left) versus. SKAT(Avg) and mKU (right). The significance threshold of selection for each model is represented by the horizontal red dashed lines (see Materials and Methods). (C) The selected significant co-expression gene modules by BayesKAT demonstrate higher fractions of overlapping with significant SNPs from orthogonal GWAS studies, compared with the results from SKAT(Avg) and mKU. (D) The selected significant PPI modules by BayesKAT demonstrate higher fractions of overlapping with significant SNPs from orthogonal GWAS studies, compared with the results from SKAT(Avg) and mKU. (E) Prioritized PPI modules by BayesKAT (left) versus SKAT(Avg) and mKU (right). The significance threshold of selection for each model is represented by the horizontal red dashed lines (see Materials and Methods).
BayesKAT identifies trait-associated gene modules and protein complexes
To further demonstrate BayesKAT’s capability of revealing novel group-level associations to traits from specific sets of cooperative SNPs, two additional SNP grouping strategies are applied: (1) SNPs from co-expression gene modules; and (2) SNPs from PPI modules (see Materials and Methods). Applied on the SNP groups aggregated from co-expression gene modules, BayesKAT is able to pinpoint specific modules as significantly associated with the whole brain volume trait (Figure 6B Left). On the other hand, SKAT(Avg) and mKU identify a large number of modules (Figure 6B Right), which are consistent with the inflated type 1 errors of these two methods as observed previously, and failed to provide specific prioritizations of the gene modules. Remarkably, comparing with the significant GWAS SNPs identified from another genome-wide meta-analysis of brain volume study [56], two out of the four selected modules by BayesKAT (
) contain significant GWAS SNPs (Figure 6C). In contrast, only
(7/27) and
(6/14) modules selected by mKU and SKAT(Avg) contain significant GWAS SNPs. These results further provide orthogonal support for the superior performance of BayesKAT in identifying new collective associations to the complex traits for SNP groups of functionally related genes.
When applied to SNP groups organized according to PPI modules that largely represent protein complexes, BayesKAT distinctly prioritizes eight PPI modules as significant, six of which contain significant GWAS SNPs (
, Figure 6D, Figure 6E Left). On the contrary, only
(17/56) and
(13/20) of the modules selected by mKU and SKAT(Avg), respectively, contain significant GWAS SNPs (Figure 6D, Figure 6E Right). Taken together, the highly specific prioritizations of potential gene modules and protein complexes, along with the substantially improved justification from other GWAS SNPs, suggest that BayesKAT can facilitate novel discoveries of molecular components involved in complex traits and may pave the way for innovative approaches to disease treatments.
DISCUSSION
BayesKAT (https://github.com/wangjr03/BayesKAT) is a data-adaptive methodology that automatically selects the appropriate composite kernel using the MCMC algorithm (BayesKAT-MCMC) or the optimization technique (BayesKAT-MAP) and conducts hypothesis testing on the presence of group-level genetic associations for complex traits. The Bayesian framework and the inferred posterior probabilities are more interpretable and informative, compared with P-values from frequentist methods. Based on extensive benchmark analyses, BayesKAT demonstrates consistent superior performance than other methods across different settings. Moreover, evaluated on a series of biologically inspired SNP groups based on a real genetic dataset, BayesKAT not only achieves improved prioritization of functionally relevant and justified group-level SNP associations, but also enables novel discoveries with respect to the underlying molecular mechanisms of complex traits. By revealing the collective effects of functionally cooperative SNPs without relying on the prior knowledge of specific kernels, BayesKAT represents one important step forward toward the goal of deciphering the intricate genetic basis of human diseases.
Although some methods based on the Gaussian process [57] or supervised learning technique [58] attempt to select the best kernel using training data for prediction purposes, BayesKAT is the first Bayesian KBT methodology that simultaneously selects the optimal composite kernel while testing for the associations, without requiring the training data. In addition, the data-adaptive strategy of composite kernel selections also facilitates the description of more complicated interdependence structures that cannot be fully captured by individual kernels. Furthermore, BayesKAT provides the flexibility of incorporating multiple testing corrections, integrating prior biological knowledge, and modeling various data types. To complement the MCMC strategy, BayesKAT-MAP is highly scalable and can be efficiently implemented for large-scale genome-wide studies.
BayesKAT utilizes the Metropolis–Hastings MCMC algorithm in combination with a derivative-free grid-search-based optimization approach to choose the composite kernel for specific datasets. Nevertheless, the BayesKAT framework is not restricted to these techniques. Other efficient MCMC sampling algorithms or reliable optimization techniques can be incorporated. A variety of sampling techniques have been reviewed and compared for different purposes of modeling [59–62]. Integrating these techniques, especially the variational Bayes techniques, into the BayesKAT framework and systematically evaluating their performance for different types of applications will be an important step for future developments (More discussion on this topic is included in the supplementary files).
Key Points
Bayesian model to automatically select the optimal composite kernels in a data-adaptive way.
BayesKAT consistently boosts the performance of genome-wide genetic association tests for complex diseases.
BayesKAT captures the group-level joint effects of cooperative SNPs.
Implementation of BayesKAT leads to new discoveries of disease-associated SNPs and the underlying molecular mechanisms.
BayesKAT facilitates the mechanistic insights (e.g. genes, pathways, gene modules and protein complexes) into complex phenotypes.
Supplementary Material
ACKNOWLEDGEMENTS
We thank MSU iCER for providing the high-performance computing infrastructure.
Contributor Information
Sikta Das Adhikari, Department of Statistics and Probability, Michigan State University, East Lansing, MI 48824, USA; Department of Computational Mathematics, Science and Engineering, Michigan State University, East Lansing, MI 48824, USA.
Yuehua Cui, Department of Statistics and Probability, Michigan State University, East Lansing, MI 48824, USA.
Jianrong Wang, Department of Computational Mathematics, Science and Engineering, Michigan State University, East Lansing, MI 48824, USA.
FUNDING
This work was supported, in part, by awards R01GM131398 from the National Institutes of Health and NSF1942143 from the National Science Foundation. Data collection and sharing for this project was funded by the Alzheimer’s Disease Neuroimaging Initiative(ADNI) (National Institutes of Health Grant U01 AG024904) and DOD ADNI (Department of Defense award number W81XWH-12-2-0012). ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: AbbVie, Alzheimer’s Association; Alzheimer’s Drug Discovery Foundation; Araclon Biotech; BioClinica, Inc.; Biogen; Bristol-Myers Squibb Company; CereSpir, Inc.; Cogstate; Eisai Inc.; Elan Pharmaceuticals, Inc.; Eli Lilly and Company; EuroImmun; F. Hoffmann-La Roche Ltd and its affiliated company Genentech, Inc.; Fujirebio; GE Healthcare; IXICO Ltd.; Janssen Alzheimer Immunotherapy Research & Development, LLC.; Johnson & Johnson Pharmaceutical Research & Development LLC.; Lumosity; Lundbeck; Merck & Co., Inc.; Meso Scale Diagnostics, LLC.; NeuroRx Research; Neurotrack Technologies; Novartis Pharmaceuticals Corporation; Pfizer Inc.; Piramal Imaging; Servier; Takeda Pharmaceutical Company; and Transition Therapeutics. The Canadian Institutes of Health Research is providing funds to support ADNI clinical sites in Canada. Private sector contributions are facilitated by the Foundation for the National Institutes of Health (www.fnih.org). The grantee organization is the Northern California Institute for Research and Education, and the study is coordinated by the Alzheimer’s Therapeutic Research Institute at the University of Southern California. ADNI data are disseminated by the Laboratory for Neuro Imaging at the University of Southern California.
DATA AVAILABILITY
BayesKAT is an open-source infrastructure available in the GitHub repository https://github.com/wangjr03/BayesKAT. The repository includes R codes for BayesKAT, along with comprehensive instructions, sample testing data and code for pre-processing real data. Please note that the Alzheimer’s Disease Neuroimaging Initiative (ADNI) data used for this manuscript are not publicly available, and interested users can request for access through the official portal: https://adni.loni.usc.edu/data-samples/access-data/. Detailed information on data download and processing procedures can be found in the supplementary material.
References
- 1. Bertram L, Tanzi RE. Thirty years of alzheimer’s disease genetics: the implications of systematic meta-analyses. Nat Rev Neurosci 2008;9(10):768–78. [DOI] [PubMed] [Google Scholar]
- 2. Vyse TJ, Todd JA. Genetic analysis of autoimmune disease. Cell 1996;85(3):311–8. [DOI] [PubMed] [Google Scholar]
- 3. Uffelmann E, Huang QQ, Munung NS, et al.. Genome-wide association studies. Nat Rev Methods Primers 2021;1. [Google Scholar]
- 4. Boyle EA, Li YI, Pritchard JK. An expanded view of complex traits: from polygenic to omnigenic. Cell 2017;169(7):1177–86. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Lander ES. Initial impact of the sequencing of the human genome. Nature 2011;470(7333):187–97. [DOI] [PubMed] [Google Scholar]
- 6. Franke A, McGovern DPB, Barrett JC, et al.. Genome-wide meta-analysis increases to 71 the number of confirmed crohn’s disease susceptibility loci. Nat Genet 2010;42(12):1118–25. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Voight BF, Scott LJ, Steinthorsdottir V, et al.. Twelve type 2 diabetes susceptibility loci identified through large-scale association analysis. Nat Genet 2010;42(7):579–89. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Furlong LI. Human diseases through the lens of network biology. Trends Genet 2012;29(3):150–9. [DOI] [PubMed] [Google Scholar]
- 9. Chakravarti A, Turner TN. Revealing rate-limiting steps in complex disease biology: the crucial importance of studying rare, extreme-phenotype families. Bioessays 2016;38(6):578–86. [DOI] [PubMed] [Google Scholar]
- 10. Kwee LC, Liu D, Lin X, et al.. A powerful and flexible multilocus association test for quantitative traits. Am J Hum Genet 2008;82(2):386–97. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Liu D, Lin X, Ghosh D. Semiparametric regression of multidimensional genetic pathway data: least-squares kernel machines and linear mixed models. Biometrics 2007;63(4):1079–88. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Wu MÂC, Lee S, Cai T, et al.. Rare-variant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet 2011;89(1):82–93. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Li S, Cui Y. Gene-centric gene-gene interaction: a model-based kernel machine method. Ann Appl Stat 2012;6(3):1134–61. [Google Scholar]
- 14. Liu D, Ghosh D, Lin X. Estimation and testing for the effect of a genetic pathway on a disease outcome using logistic kernel machine regression via logistic mixed models. BMC Bioinformatics 2008;9:1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Marceau R, Wenbin L, Holloway S, et al.. A fast multiple-kernel method with applications to detect gene-environment interaction. Genet Epidemiol 2015;39(6):456–68. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Wu MC, Kraft P, Epstein MP, et al.. Powerful snp-set analysis for case-control genome-wide association studies. Am J Hum Genet 2010;86(6):929–42. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Seunggeun Lee, Zhangchen Zhao, with contributions from Larisa Miropolsky, and Michael Wu. SKAT: SNP-Set (Sequence) Kernel Association Test, 2023. R package version 2.2.5.
- 18. Wu MC, Maity A, Lee S, et al.. Kernel machine snp-set testing under multiple candidate kernels. Genet Epidemiol 2013;37(3):267–75. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Wessel J, Schork NJ. Generalized genomic distance–based regression methodology for multilocus association analysis. Am J Hum Genet 2006;79(5):792–806. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Lin X, Cai T, Wu MC, et al.. Kernel machine snp-set analysis for censored survival outcomes in genome-wide association studies. Genet Epidemiol 2011;35(7):620–31. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. He T, Li S, Zhong P-S, Cui Y. An optimal kernel-based u-statistic method for quantitative gene-set association analysis. Genet Epidemiol 2019;43(2):137–49. [DOI] [PubMed] [Google Scholar]
- 22. The ADNI team . ADNIMERGE: Alzheimer’s Disease Neuroimaging Initiative, 2023, R package version 0.0.1.
- 23. Kanehisa M, Goto S. Kegg: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res 2000;28(1):27–30. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Kanehisa M. Toward understanding the origin and evolution of cellular organisms. Protein Sci 2019;28(11):1947–51. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Kanehisa M, Furumichi M, Sato Y, et al.. Kegg for taxonomy-based analysis of pathways and genomes. Nucleic Acids Res 2023;51(D1):D587–92. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Kass RE, Raftery AE. Bayes factors. J Am Stat Assoc 1995;90(430):773–95. [Google Scholar]
- 27. Metropolis N, Rosenbluth AW, Rosenbluth MN, et al.. Equation of state calculations by fast computing machines. J Chem Phys 2004;21(6):1087–92. [Google Scholar]
- 28. Hastings WK. Monte Carlo sampling methods using Markov chains and their applications. Biometrika 1970;57(1):97–109. [Google Scholar]
- 29. Gelman A, Carlin JB, Stern HS, Rubin DB. John B Carlin, Hal S Stern, and Donald B Rubin. Bayesian data analysis. Chapman and Hall/CRC, 1995. [Google Scholar]
- 30. Robert CP, Casella G, Casella G. Monte Carlo statistical methods, Vol. 2. New York: Springer, 1999. [Google Scholar]
- 31. Chib S, Jeliazkov I. Marginal likelihood from the metropolis–Hastings output. J Am Stat Assoc 2001;96(453):270–81. [Google Scholar]
- 32. Hartig F, Minunno F, Paul S. BayesianTools: General-Purpose MCMC and SMC Samplers and Tools for Bayesian Statistics, 2023, R package version 0.1.8.
- 33. Haario H, Saksman E, Tamminen J. An adaptive metropolis algorithm. Ther Ber 2001;7:223–42. [Google Scholar]
- 34. Roy V. Convergence diagnostics for markov chain Monte Carlo. Annu Rev Stat Appl 2020;7:387–412. [Google Scholar]
- 35. Gelman A, Rubin DB. Inference from iterative simulation using multiple sequences. Stat Sci 1992;7:457–72. [Google Scholar]
- 36. Bates D, Mullen KM, Nash JC, Varadhan R. minqa: Derivative-free optimization algorithms by quadratic approximation. R package version. 2014;1(4). [Google Scholar]
- 37. Evans M, Swartz T. Methods for approximating integrals in statistics with special emphasis on bayesian integration problems. Statistical science 1995;10:254–72. [Google Scholar]
- 38. Tierney L, Kadane JB. Accurate approximations for posterior moments and marginal densities. J Am Stat Assoc 1986;81(393):82–6. [Google Scholar]
- 39. Kass RE. The validity of posterior expansions based on laplace’s method. Bayesian and likelihood methods in statistics and econometrics 1990;473–87. [Google Scholar]
- 40. Pauler DK, Wakefield JC, Kass RE. Bayes factors and approximations for variance component models. J Am Stat Assoc 1999;94(448):1242–53. [Google Scholar]
- 41. Purcell S, Neale B, Todd-Brown K, et al.. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet 2007;81(3):559–75. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42. Sesia M, Sabatti C, Candès EJ. Gene hunting with hidden markov model knockoffs. Biometrika 2019;106(1):1–18. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43. Saykin AJ, Shen L, Foroud TM, et al.. Alzheimer’s disease neuroimaging initiative biomarkers as quantitative phenotypes: genetics core aims, progress, and plans. Alzheimers Dement 2010;6(3):265–73. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44. Harrow J, Frankish A, Gonzalez JM, et al.. GENCODE: the reference human genome annotation for the ENCODE project. Genome Res 2012;22(9):1760–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45. Zhang B, Horvath S. A general framework for weighted gene co-expression network analysis. Stat Appl Genet Mol Biol 2005;4(1). [DOI] [PubMed] [Google Scholar]
- 46. Langfelder P, Horvath S. Wgcna: an r package for weighted correlation network analysis. BMC Bioinformatics 2008;9(1): 1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47. Wang H, Huang B, Wang J. Predict long-range enhancer regulation based on protein–protein interactions between transcription factors. Nucleic Acids Res 2021;49(18):10347–68. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48. Munot P, Saunders DE, Milewicz DM, et al.. A novel distinctive cerebrovascular phenotype is associated with heterozygous arg179 acta2 mutations. Brain 2012;135(8):2506–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49. Nho K, Corneveaux J, Kim S, et al.. Whole-exome sequencing and imaging genetics identify functional variants for rate of change in hippocampal volume in mild cognitive impairment. Mol Psychiatry 2013;18:781–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50. do Rosario MC, Bey GR, Nmezi B, et al.. Variants in the zinc transporter tmem163 cause a hypomyelinating leukodystrophy. Brain 2022;145(12):4202–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51. Ng B, White CC, Klein H-U, et al.. An xqtl map integrates the genetic architecture of the human brain’s transcriptome and epigenome. Nat Neurosci 2017;20(10):1418–26. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52. Sluimer JD, van der Flier WM, Karas GB, et al.. Whole-brain atrophy rate and cognitive decline: longitudinal mr study of memory clinic patients. Radiology 2008;248(2):590–8. [DOI] [PubMed] [Google Scholar]
- 53. Squitieri F, Cannella M, Simonelli M, et al.. Distinct brain volume changes correlating with clinical stage, disease progression rate, mutation size, and age at onset prediction as early biomarkers of brain atrophy in huntington’s disease. CNS Neurosci Ther 2009;15(1):1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54. Mezzapesa DM, Ceccarelli A, Dicuonzo F, et al.. Whole-brain and regional brain atrophy in amyotrophic lateral sclerosis. Am J Neuroradiol 2007;28(2):255–9. [PMC free article] [PubMed] [Google Scholar]
- 55. Burton EJ, McKeith IG, Burn DJ, et al.. Cerebral atrophy in parkinson’s disease with and without dementia: a comparison with alzheimer’s disease, dementia with lewy bodies and controls. Brain 2004;127(4):791–800. [DOI] [PubMed] [Google Scholar]
- 56. Jansen PR, Nagel M, Watanabe K, et al.. Genome-wide meta-analysis of brain volume identifies genomic loci and genes shared with intelligence. Nat Commun 2020;11:5606. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57. Teng T, Chen J, Zhang Y, Low BK. Scalable variational bayesian kernel selection for sparse gaussian process regression. Proceedings of the AAAI Conference on Artificial Intelligence, 34(04):5997–6004, 2020. [Google Scholar]
- 58. Gönen M, Alpaydin E. Multiple kernel learning algorithms. J Mach Learn Res 2011;12:2211–68. [Google Scholar]
- 59. Han C, Carlin BP. Markov chain Monte Carlo methods for computing bayes factors: a comparative review. J Am Stat Assoc 2001;96(455):1122–32. [Google Scholar]
- 60. DiCiccio TJ, Kass RE, Raftery A, Wasserman L. Computing bayes factors by combining simulation and asymptotic approximations. J Am Stat Assoc 1997;92(439):903–15. [Google Scholar]
- 61. Sinharay S, Stern HS. An empirical comparison of methods for computing bayes factors in generalized linear mixed models. J Comput Graph Stat 2005;14(2):415–35. [Google Scholar]
- 62. Friel N, Pettitt A. Marginal likelihood estimation via power posteriors. J R Stat Soc Series B 2008;70:589–607. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
BayesKAT is an open-source infrastructure available in the GitHub repository https://github.com/wangjr03/BayesKAT. The repository includes R codes for BayesKAT, along with comprehensive instructions, sample testing data and code for pre-processing real data. Please note that the Alzheimer’s Disease Neuroimaging Initiative (ADNI) data used for this manuscript are not publicly available, and interested users can request for access through the official portal: https://adni.loni.usc.edu/data-samples/access-data/. Detailed information on data download and processing procedures can be found in the supplementary material.











