Skip to main content
American Journal of Human Genetics logoLink to American Journal of Human Genetics
. 2025 Jul 29;112(9):2198–2212. doi: 10.1016/j.ajhg.2025.07.004

Sparse modeling of interactions enables fast detection of genome-wide epistasis in biobank-scale studies

Julian Stamp 1,, Samuel Pattillo Smith 2,3,6, Daniel Weinreich 1,4, Lorin Crawford 5,∗∗
PMCID: PMC12461027  PMID: 40738108

Summary

The lack of computational methods capable of detecting epistasis in biobanks has led to uncertainty about the role of non-additive genetic effects on complex trait variation. The marginal epistasis framework is a powerful approach because it estimates the likelihood of a SNP being involved in any interaction, thereby reducing the multiple testing burden. Current implementations of this approach have failed to scale genome wide in large human studies. To address this, we present the sparse marginal epistasis (SME) test, which concentrates the scans for epistasis to regions of the genome that have known functional enrichment for a quantitative trait of interest. By leveraging the sparse nature of this modeling setup, we develop a statistical algorithm that allows SME to run 10–90 times faster than state-of-the-art epistatic mapping methods. In a study of complex traits measured in 349,411 individuals from the UK Biobank, we show that reducing searches of epistasis to variants in functionally enriched regions facilitates the identification of genetic interactions associated with regulatory genomic elements.

Keywords: epistasis, complex traits, linear mixed models, sparsity

Graphical abstract

graphic file with name fx1.jpg


The sparse marginal epistasis test overcomes computational limitations of previous mapping approaches by focusing its epistatic search to regions of the genome that have some known functional relationship with a trait. This strategy both significantly improves its statistical power and allows it to scale genome wide for large human biobank studies.

Introduction

Genome-wide association studies (GWASs) have identified thousands of genetic loci linked with various complex traits and common diseases, offering valuable insights into the genetic foundations of phenotypic variation.1 As of late, there have been many efforts to estimate proportions of genetic variance beyond what is attributable to additive effects.2,3,4,5,6 Epistasis, which refers to interactions between genetic loci, is thought to play a key role in constituting the genetic basis of evolution.7,8 While many studies have shown epistasis to be pervasive in model organisms,9,10,11 controversies remain with respect to its role in humans.12 For example, some epistatic interactions identified in association mapping studies can be explained by additive effects of unobserved variants.13 Though previous studies have shown that genetic variance is mainly additive,9,14 these conclusions have recently been challenged.5

Numerous statistical methods have been developed to identify single-nucleotide polymorphisms (SNPs) that contribute to epistasis. Traditional approaches focus on explicitly detecting significant interactions through exhaustive or probabilistic searches utilizing frequentist tests, Bayesian inference, and machine learning techniques.15,16,17,18 With advancements in sequencing technologies, many contemporary GWASs are conducted on biobank-scale datasets comprising hundreds of thousands of individuals genotyped at millions of markers and phenotyped for thousands of traits.1,19,20 This is crucial, as the effect of epistatic interactions is hypothesized to be small for many traits,12,14 and traditional search algorithms are known to be most powered when large training datasets are available.14,15 However, despite efficient computational improvements, exploring large combinatorial domains continues to pose a challenge for epistatic mapping studies. With a lack of a priori knowledge about which epistatic loci to prioritize, exploring all possible combinations of genetic variants can result in low statistical power after correcting for multiple hypothesis tests (e.g., there are J choose 2 possible pairwise combinations for a study with J SNPs).

As an alternative to traditional exhaustive search methods, the marginal epistasis framework was developed to estimate the combined pairwise interaction effects between a focal SNP and all other variants in the dataset. The “marginal epistasis test” (MAPIT) evaluates each SNP individually and identifies candidates involved in epistasis without requiring the identification of their exact interacting partners.2 Recently, the concept of marginal epistasis has been leveraged to estimate the contribution of non-additive heritability in complex traits using GWAS summary statistics.5 It has also been extended to explore the importance of genetic interactions across multiple traits simultaneously.3 Theoretically, MAPIT is formulated as a linear mixed model where the random effects and corresponding variance components are estimated using a method-of-moments (MoM) algorithm.21,22 Although MAPIT mitigates the reduction of -power due to the multiple testing burden, its implementation on datasets with large sample sizes remains computationally intensive.2 Specifically, the computational complexity scales linearly with the number of SNPs and (at best) quadratically with the number of individuals, making it suitable for moderately sized GWAS applications but infeasible for biobank-scale studies.23,24 Efforts have been made to address this limitation, such as the “fast marginal epistasis test” (FAME),4 which leverages a stochastic MoM framework and introduces both computationally efficient stochastic trace estimators25 and innovative methods to expedite matrix multiplication.26 However, despite these advancements, further work is necessary to scale the method to genome-wide applications.

This work introduces the sparse marginal epistasis (SME) test, which focuses on searching for epistasis in regions of the genome with known functional enrichment27 related to a quantitative trait of interest. This method has two main advantages. First, it prioritizes candidate regions likely to involve epistatic gene action. Studies have indicated that variants in coding regions account for less than 10% of the phenotypic variance in many traits and diseases.28 Consequently, the remaining heritability is attributed to regions expected to play a regulatory role27,28,29 and that are active in trait-specific tissue.30,31 Second, the sparse nature of this approach leads to more efficient estimators for model parameters22,32 and allows SME to operate significantly faster than existing methods, such as MAPIT and FAME. Through detailed simulations, SME demonstrates effective type I error control and improved power compared to previous approaches. Furthermore, utilizing information from DNase I-hypersensitivity sites in ex vivo human erythroid differentiation33 and GWAS summary statistics, we use SME to analyze complex traits in individuals from the UK Biobank19 and identify genetic interactions associated with regulatory genomic elements.

Material and methods

The SME test

The SME test performs a genome-wide search for SNPs involved in genetic interactions while conditioning on information derived from functional genomic data (Figure 1A). Consider a GWAS with N individuals who have been genotyped for J SNPs encoded as {0,1,2} copies of a reference allele at each locus. Also assume that we have access to an external reference S that encodes some additional biological information about the quantitative trait being studied. The marginal epistasis test aims to identify genetic variants that are involved in epistasis without exhaustively searching over all possible interactions.2 By examining one SNP at a time (indexed by j), SME fits the following linear mixed model:

y=μ+lxlβl+lj(xjxl)αl·1S(wl)+ε,εN(0,τ2I) (Equation 1)

where y is an N-dimensional quantitative trait vector measured for each individual in the study; μ is an intercept term; X is the N×J matrix of allele counts that have been column standardized across individuals with xl representing an N-dimensional vector for the l-th SNP; βl is the additive effect for the l-th SNP; xjxl is the Hadamard (elementwise) product of the two genotypic vectors with corresponding interaction effect size αl; ε is a normally distributed error term with mean zero and scale variance term τ2; and I denotes an N×N identity matrix. The key to this formulation is that the inclusion of the interaction between the j-th and l-th SNPs is based on an indicator function

1S(wl)={1ifwlS0ifwlS, (Equation 2)

where wl encodes information about the l-th SNP. For example, if testing for epistatic effects in red blood cell traits, we can incorporate information about regulatory regions during erythroid differentiation into the model (Figure 1B). In this case, S could be a set of genomic regions for which DNase sequencing (DNase-seq) implicates chromatin accessibility and wl could encode the physical location of the l-th SNP on the genome. Here, 1S(wl)=1 if the l-th SNP is located in one of these regions (i.e., wlS). This means that while all SNPs are tested for marginal epistasis, only their interactions with SNPs included in the mask resulting from Equation 2 are considered.

Figure 1.

Figure 1

Schematic overview of the sparse marginal epistasis test

(A) Sparse marginal epistasis (SME) examines one SNP at a time and estimates marginal epistatic effects—the combined pairwise interaction effects between a j-th focal SNP and other variants on the genome (indexed by lj). The key to SME is that it incorporates genomic data S through a binary indicator function 1S(wl), where wl provides information about the l-th background SNP. This creates a mask to only search for interactions in regions of the genome with known functional enrichment related to a trait of interest.

(B) As an example, let data on DNase I-hypersensitive sites (DHS) be used for S. In this case, SME restricts the marginal epistasis test to assessing interactions between each focal SNP and variants with genomic coordinates that fall within open chromatin and regulatory regions. The DNase-seq signal is converted into a binary mask, excluding variants located in regions with closed chromatin (i.e., variants with coordinates wlS).

(C) SME tests every SNP genome wide. Masking results in improved power to detect marginally epistatic variants versus the traditional non-masking approach.

(D) SME uses computationally efficient estimators to enable genome-wide testing on biobank-scale datasets. It achieves runtimes 10×–90× faster than current state-of-the-art methods.

The benefits of this sparse approach are 2-fold. First, by limiting the search to regions of the genome that are most likely to be functionally associated with a phenotype, SME produces significantly more efficient estimators, which leads to an increase in power (Figure 1C). Second, by masking out sets of variants for each test on a focal SNP, SME leverages a fast MoM algorithm to substantially improve its scalability for genome-wide analyses (Figure 1D). Specifically, SME introduces an approximation to the efficient stochastic trace estimator,4,23,25,34 which allows the algorithm to avoid repeating costly matrix computations when estimating model parameters across each SNP that is being tested (Figures S1–S3).

Variance component model formulation

For biobank-scale data, there are often more SNPs than individuals. To overcome an undetermined system in Equation 1, the marginal epistasis framework assumes that the effect sizes follow univariate normal distributions where βlN(0,ω2/J) and αlN(0,σ2/J), with J=lj1S(wl) representing the number of interactions considered in the model.2,32,35,36,37 Assuming that the phenotype has been mean centered and scaled, these normal assumptions allow Equation 1 to be rewritten as

y=m+gj+εεN(0,τ2I), (Equation 3)

where m=lxlβl is the combined additive effects from all SNPs and gj=lj(xjxl)αl·1S(wl) represents the effects of a subset of pairwise interactions involving the j-th SNP.

Probabilistically, Equation 3 translates to SME assuming that mN(0,ω2K), where the covariance matrix K=XX/J accounts for the relatedness between individuals in the data and the corresponding component ω2 models the phenotypic variance explained (PVE) by additive effects. The second term can be written as gjN(0,σ2Gj), where Gj=DjXjWjXjDj/J, with Xj denoting the genotype matrix without the j-th SNP and Dj=diag(xj) representing an N×N diagonal matrix. Importantly, Wj=diag[1S(w1),,1S(wJ1)] has binary diagonal elements that only equate to 1 if the l-th SNP satisfies the criteria from the external data source S. Altogether, these results show that the covariance matrix Gj represents all pairwise interactions involving the j-th SNP that have not been masked out according to the set of indicator functions {1S(wl)}lj. The main takeaway from the variance component formulation of SME is that the term σ2 measures SNP-specific contribution to the non-additive genetic variance.

Point estimates and hypothesis testing

The model in Equation 3 has three variance components that can be estimated using a computationally efficient MoM algorithm.22 In expectation,

E[yAy]=k=13tr(AΣjk)δk, (Equation 4)

with A being a symmetric and non-negative definite matrix used to create weighted second moments, tr(·) denotes the trace of a matrix, and we use shorthand to represent [Σj1;Σj2;Σj3]=[K;Gj;I] and δ=(ω2,σ2,τ2), respectively. In practice, we replace the left-hand side of Equation 4 with the realized value yAy. We also use the realized covariance matrices in place of the arbitrary A. The point estimates for each variance component are then given as

δˆjk=yHjky, (Equation 5)

where Hjk=t=13(Sj1)ktΣjt and Sj is a 3×3 matrix with elements (Sj)kt=tr(ΣjkΣjt).

SME tests for non-zero marginal epistasis using a one-sided Z score or normal test. This is equivalent to assessing the null hypothesis H0:σ2=0 for each SNP in the data. We derive a test statistic with the estimate σˆj2 using Equation 5 and compute an approximate standard error

V[σˆj2]2yHjVjHjy, (Equation 6)

where Vj=ωˆj2K+σˆj2Gj+τˆj2I. Note that the point estimates from Equations 4 and 5 are unbiased and can lead to negative values when the true variance component is zero.22 The one-sided hypothesis test formalizes the constraint that only positive estimates of σˆj2 can be indicative of marginal epistasis.

Scalable computation via stochastic MoM

The right-hand side of Equation 5 involves computing traces of matrix products. If each covariance matrix is held in memory, then this can be done efficiently (without matrix multiplication) using the Frobenius inner product. However, holding large covariance matrices in memory itself prevents scalability of the method. Naively multiplying two N×N matrices requires N3 field operations. This too can be impractical for biobank-scale data with large sample sizes. To enable genome-wide testing, SME makes use of a stochastic MoM approach through the implementation of Hutchinson’s stochastic trace estimator and the Mailman algorithm. The stochastic trace implementation enables block-wise processing of genotype data. With this approach, SME computes the traces of all covariance matrix products without ever needing to explicitly estimate the covariance matrices themselves. By making the block size (i.e., the number of SNPs processed simultaneously) configurable, the computation can be performed using as little as 1 gigabyte (GB) of read access memory (RAM).

Hutchinson’s stochastic trace estimator

Hutchinson’s stochastic trace estimator approximates the trace of a matrix product via the following4,23,25,34:

tr(ΣjrΣjs)1Bb=1BzbΣjrΣjszb, (Equation 7)

where zbN(0,I) is a normally distributed vector and B is the number of random draws used to approximate the trace. This operation only depends on a series of matrix-by-vector products and has time complexity OBNJ. Essentially, we choose an order of operations such that computing the quadratic forms zbTXXXXzb=XXzb2 is reduced by (1) applying X to the vector zb and then (2) applying X to the resulting vector Xzb for all B random vectors. Importantly, Equation 7 can be set up algorithmically such that the approximation is done block-wise over the individual-level genotype data. Using this approach, SME avoids having to compute any of the covariance matrices directly and alleviates the need to load the entire genotype matrix into memory all at once.

The Mailman algorithm

An additional computational speedup can be achieved by making use of the discrete encoding for each SNP. The Mailman algorithm allows for an N×J matrix to be multiplied by any real vector in ONJ/logΩmaxN,J time if it has elements defined over a finite alphabet size Ω.4,23,26,34 A standardized genotype matrix can be written as X=(AU)Q1, where A is an N×J allele count matrix with elements aij{0,1,2} over finite size Ω=3, U=[u1,,uJ] is a matrix where the j-th column contains the sample mean for the j-th SNP and the variance of each SNP as the diagonal entries. With this specification, we can write Xzb=Q1(AU)zb. The first term, Q1Azb, can be solved in ONJ/log3maxN,J time. The second term, Q1Uzb, corresponds to scaling the random N-dimensional vector zb, which can be computed in time ON+J.4,23,34

Shared random vectors and parallelization

With the stochastic trace estimator and the Mailman algorithm, it is feasible to estimate the variance components for each focal SNP even when a study has a large number of individuals. Still, testing every focal SNP against all variants genome wide remains a challenge. In SME, we propose randomly selecting subsets of focal SNPs and having them share the same random vectors zb when performing the stochastic trace estimation. This limits the number of computations that need to be performed while maintaining unbiasedness in the point estimates (Figure S4).

Since the error terms in Equation 3 are assumed to be independent, the only two intermediate products that need to be computed in Equation 7 are Kzb and Gjzb. With these terms, we can compute all combinations of traces of matrix products that are required to fit SME (e.g., tr(K2)Kzb2 and tr(KGj)zbKGjzb). For a subset of L focal SNPs and fixed zb, the term Kzb is constant, and only Gjzb changes. This reduces the effective time needed to compute Kzb per test by a factor of 1/L. Figure S1 illustrates the idea of sharing random vectors.

As previously mentioned, reading biobank-scale genotype data into memory requires non-negligible overhead. The R implementation of SME reads in genotypes once for each subset of focal SNPs that share the same random vectors (web resources). The computation of Kzb and each Gjzb can then be done in parallel using multithreading. While sharing random vectors helps accelerate this task, it also requires storing larger quantities of intermediate results in memory.

Masking further reduces the effective size of data

A direct benefit of masking is that it is equivalent to removing entire columns from the genotype matrix. This reduction contributes to a significantly faster runtime when computing Gjzb (e.g., Figures S2 and S3). Concretely, applying a mask to an N×J genotype matrix reduces the number of columns to JJ. If the mask is sparse enough such that JN, the time complexity of the Mailman algorithm is also reduced to OBNJ/log3N.

Preprocessing the UK Biobank

Genotype data for 488,377 individuals in the UK Biobank were downloaded and converted using the ukbgene and ukbconv tools, respectively. Continuous traits were also downloaded using the ukbgene tool and were adjusted for age and sex. Individuals identified as having high heterozygosity, excessive relatedness, or aneuploidy were removed (1,550 individuals). After separating individuals into self-identified ancestral cohorts using data field 21000, unrelated individuals were selected by randomly choosing one person from each related pair. This resulted in N= 349,411 White British individuals to be included in our analysis. We downloaded imputed SNP data from the UK Biobank for all remaining individuals and removed SNPs with an information score below 0.8. Information scores for each SNP are provided by the UK Biobank (web resources).

Quality control for the remaining genotyped and imputed 1,933,118 variants was then performed on each cohort separately using the following steps. All structural variants were first removed, leaving only SNPs in the genotype data. Next, all AT/CG SNPs were removed to avoid possible confounding due to sequencing errors. Then, SNPs with a minor-allele frequency less than 1% were removed using the PLINK 2.038 command –maf 0.01. We then removed all SNPs found to be out of Hardy-Weinberg equilibrium, using the PLINK –hwe 0.000001 flag to remove all SNPs with a Fisher’s exact test p <106. Finally, SNPs with any missingness were removed using the PLINK 2.0 –geno 0.00 flag. This left a total of J= 543,813 SNPs for our study.

Assessing replicability and robustness of the study

To confirm that the marginal epistatic signal identified by SME in the UK Biobank is robust to sample composition, we randomly split the data into two distinct subsets of equal size. This resulted in two cohorts with N= 174,705 individuals and J= 543,813 SNPs. Additionally, we assessed the scale invariance of the signal identified by SME. Here, we reran the analysis for all significant marginal epistatic associations using quantile-normalized versions of each trait. The quantile normalization was performed using the qqnorm function in R.

GWAS summary statistics

The summary statistics used to compare against marginal epistatic results for each trait in the UK Biobank were downloaded (web resources). These summary statistics were first filtered to match the same set of SNPs that passed our quality control. SNPs that were reported as being associated with a trait at genome-wide significance (p<5×108) in the UK Biobank European cohort are highlighted in our analyses.

Generating masks from external data sources

Below is a description of the datasets that we used to generate the masks for SME when analyzing individuals and quantitative traits from the UK Biobank.

Masks using DNase I-hypersensitive sites

The chromatin accessibility-based masks were derived from DNase I-hypersensitive sites (DHSs) measured over 12 days of ex vivo erythroid differentiation.33 The DHS intervals were reported using the hg38 human reference genome. To map correspondence to the UK Biobank data (which use the hg19 as reference), we performed a lift over using CrossMap.39 We mapped each SNP in the UK Biobank to the genomic intervals in the DHS data using the R software package GenomicRanges.40 The resulting mask comprised J= 4,952 SNPs.

Masks using GWAS summary statistics

Many complex traits have genetic signatures that are not tissue specific, and in many cases, biologically informative annotations from external functional studies may not be available. As a more general strategy, in the absence of trait-specific biological information, we induce sparsity within the SME framework using GWAS summary statistics. The motivation behind this approach is 2-fold. First, variant-level associations from genome-wide studies can serve as proxies for more targeted, biologically informed priors. For example, genomic regions identified by DNase-seq have been shown to be enriched for non-coding variants associated with common diseases and complex traits.27 Second, GWAS summary statistics have been suggested to tag non-additive genetic effects contributing to trait architecture.5 In practice, this results in masks of varying degrees of sparsity and mask sizes (using a genome-wide significance threshold p<5×108). In our quality-controlled data from the UK Biobank, J= 16,142 SNPs are associated with body height, J= 7,778 SNPs are associated with mean corpuscular hemoglobin (MCH), J= 3,536 SNPs are associated with uric acid (or urate), and J= 547 SNPs are associated with vitamin D levels (VITDs). To match the sparsity observed in DNase I-hypersensitivity data, we select the top 5,000 SNPs ranked by strength of association (lowest p values) for height and MCH and include all significant SNPs for urate and VITDs.

Small note on linkage disequilibrium blocks

To control the type I error rate, variants in the same linkage disequilibrium (LD) block as the j-th focal SNP are also masked. The LD blocks used for this study were approximately independent and derived using European individuals.41

Simulation studies

To characterize the behavior of the SME test, we generate quantitative traits using chromosome 1 of the White British cohort from the UK Biobank.2,3,5 These data consisted of N= 349,411 individuals and J= 43,332 SNPs. Here, we sample 10% of the SNPs in the data to be causal and simulate traits using the following linear model:

y=aAxaβa+g1G1g2G2(xg1xg2)αg1g2+ε,εN(0,λ2I), (Equation 8)

where y is an N-dimensional phenotype vector; A represents the set of causal SNPs with additive effects; xa is the genotype for the a-th causal SNP, which has been standardized to have mean zero and variance one across individuals; βa is the additive effect sizes for the a-th SNP; both G1 and G2 are sets of epistatic SNPs that are non-overlapping subsets of A; αg1g2 is the interaction effect sizes between xg1 and xg2; and ε is a vector of normally distributed environmental noise. We sample the effect sizes from standard normal distributions and rescale them so that the additive and epistatic effects explain a desired proportion of the trait variance. Specifically, the additive and epistatic variance components make up the broad-sense heritability of the trait H2=hA2+hG2. Similarly, the environmental noise matrix is also rescaled, such that it explains the remaining 1H2 proportion of the trait variance.

In this simulation design, the epistatic causal SNPs interact between sets. All SNPs in G1 interact with all SNPs in G2 but do not interact with variants in their own group (and vice versa). Note that we use this setup because the ability to detect interacting variants in the marginal epistasis framework depends on the proportion of phenotypic variance that they marginally explain. The parameters that determine the PVE by a single SNP are the epistatic heritability hG2 and the cardinality of the set to which the SNP belongs. For example, an SNP in G1 will explain, on average, hG2/|G1| of the total phenotypic variance.

To simulate masks, we select some proportion of the non-epistatic SNPs to zero out of the interaction covariance matrix. For example, when analyzing the j-th SNP, 95% masking corresponds to excluding 41,165 out of the 43,332 SNPs when computing Gj. Similarly, only 433 SNPs are used when using a 99% masking strategy. To create masks that induce “uniform sparsity,” we randomly sample from all SNPs in the dataset with uniform probability. In real data, masks are likely not sampled uniformly. An obvious potential source of complication for non-uniform masking can come from LD between SNPs. Therefore, to simulate masks that induce “localized sparsity,” we randomly sample a seed SNP and define a genomic window around it. SNPs outside that window are then masked.

Unless otherwise specified, we used the following hyperparameters when applying SME and FAME to the simulated data: B= 100 random vectors for the stochastic trace estimator, and we block-wise processed 100 SNPs at a time and shared random vectors across L= 25 SNPs for the SME implementation. To ensure fair comparisons between methods, all were applied to the same SNPs, and synthetic traits within each simulation replicate. Note that causal SNP sets varied across replicates due to random sampling.

Software tools and data sources

Software for running the FAME is freely available at https://github.com/sriramlab/FAME. The version of FAME that we implemented for this work (GitHub commit hash prefix cfdd03f; date May 7, 2024) has been archived at https://doi.org/10.5281/zenodo.14607997. The original MAPIT was implemented using the mvmapit software package in R and is available both on CRAN (https://cran.r-project.org/package=mvMAPIT) and GitHub (https://github.com/lcrawlab/mvMAPIT). All software for SME, FAME, and MAPIT were fit using their default settings unless otherwise stated in the main text. Data from the UK Biobank Resource was made available under application number 14649 and can be accessed by direct application to the UK Biobank. GWAS summary statistics were downloaded from https://www.nealelab.is/uk-biobank, and corresponding LD maps were taken from https://bitbucket.org/nygcresearch/ldetect-data/. The DHS data are available at https://doi.org/10.5281/zenodo.5291736. We used CrossMap (https://crossmap.readthedocs.io/) and GenomicRanges (https://doi.org/10.18129/B9.bioc.GenomicRanges) to map SNPs from the UK Biobank to genomic intervals in the DHS data. The chain file for the CrossMap liftover tool can be found at http://hgdownload.soe.ucsc.edu/goldenPath/hg38/liftOver/.

Results

SME scales to biobank GWASs

To compare the expected central processing unit (CPU) computation for conducting a biobank-scale genome-wide analysis using SME, FAME,4 and MAPIT,2 we measured the average runtime per SNP for each method on an Intel Xeon Platinum 8268 CPU using a single core (i.e., no parallel processing). Here, we used genotype data from 349,411 individuals of self-identified European ancestry in the UK Biobank with 543,813 SNPs after quality control (material and methods). The memory requirements for MAPIT are prohibitively high, requiring resources on the order of terabytes for biobank-scale datasets with hundreds of thousands of observations. By using Hutchinson’s stochastic trace estimator, both SME and FAME achieve configurable resource requirements and can effectively operate with only a few GB of memory at biobank-scale data.

We find that SME performs genome-wide testing 10× faster than FAME and 90× faster than MAPIT (Figure 2). While analyzing the complete dataset with a single core, SME requires only 3.7 days of runtime compared to FAME and MAPIT, which require 38.4 and 324 days, respectively. The greatest speedup in SME is achieved by approximating the stochastic trace when estimating model parameters (Figure S1). This allows for computations involving large genetic relatedness matrices to be reused across multiple tests for different focal SNPs. Even with the stochastic trace approximations, performing matrix calculations at biobank scale still takes minutes. However, the ability to share these computations across multiple variants significantly reduces the overall computational burden (Figures S2 and S3). The proposed approach enables SME to be effectively applied to the UK Biobank, facilitating GWASs of epistasis. For the real data application in this study, SME effectively computed hundreds of tests simultaneously with less than 85 GB of RAM for a dataset consisting of N= 349,411 individuals.

Figure 2.

Figure 2

Computational time for running SME and other marginal epistatic approaches on biobank-scale data as a function of the genome size

The other methods compared include FAME4 and MAPIT.2 Here, we analyze genotype data from a fixed set of 349,411 individuals from the UK Biobank and vary the genome size. All results were computed on a single core of an Intel Xeon Platinum 8268 central processing unit (CPU). Total runtime was calculated based on the average runtime per SNP and parallel processing on a cluster with 960 CPUs available. Both SME and FAME were set to have the same hyperparameter configurations (e.g., the number of random vectors). SME also used a binary mask that contained 5,000 unmasked SNPs, and its stochastic trace approximation was applied such that sets of 250 focal SNPs shared the same random vectors. MAPIT could not be directly compared due to its excessive memory requirements for datasets of this size. Instead, the runtime for MAPIT was measured on smaller sample sizes (up to 20,000 individuals) and extrapolated to a sample size of 349,411 individuals. This extrapolation assumed quadratic scaling with the number of individuals and linear scaling with the number of SNPs.

SME is a well-calibrated test and conserves type I error rates

We generate synthetic phenotypes using a linear model with real genotypes from chromosome 1 of White British individuals in the UK Biobank.2,3,5 After quality control, we had a dataset of 349,411 individuals and 43,332 SNPs (material and methods). Under the null model, we simulate traits consisting of only additive effects. Here, we randomly sample 10% of the SNPs and scale their effect sizes such that they explain 40% of the total phenotypic variance.

We simulate external data sources (S) to be used when generating a mask for the marginal epistatic covariance matrix Gj in SME. Recall that these external data sources are intended to give alternative insight into the importance of SNPs and are used to induce sparsity in the modeled gene interactions by dropping interactions with “unimportant” variants. We consider two scenarios in our simulations (Figure S5). In the first, SNPs deemed important in the external data source are sparsely sampled with uniform probability from all variants. As a result, the modeled gene interactions are evenly distributed along the chromosome. We will refer to this scenario as inducing uniform sparsity in the SME model. In the second scenario, we randomly sample one central seed SNP and define variants in a block around it as important. We will refer to this scenario as one that induces localized sparsity. In these simulation experiments, we assess the calibration of SME using both types of external data sources as a function of the percentage of total variants that are masked (varying between 0%, 95%, and 99%) and the number of individuals being analyzed (varying between 20,000, 50,000, 100,000, and 300,000 randomly subsampled individuals).

Under the null model, we find that SME produces well-calibrated p values and unbiased variance component estimates (Figure 3; Table S1) using a uniformly sparse mask. Specifically, higher levels of sparsity lead to more accurate estimates of the marginal epistatic variance components. We also see the precision in its estimates improve as the sample size increases, which is expected since SME uses a normal test to compute p values for each SNP. Overall, this translates to SME preserving empirical type I error rates estimated at significance levels α = 0.05, 0.01, and 0.001, respectively (Table 1). In contrast, FAME produces inflated test statistics as the number of samples in a dataset grows. Note that we do not include a comparison with MAPIT here due to its inability to scale to biobank settings (see Crawford et al.2 for an assessment of its calibration on small-to-moderately sized data).

Figure 3.

Figure 3

While using a mask that induces uniform sparsity, SME is well calibrated under the null hypothesis and does not identify epistasis when traits are generated by only additive effects

Synthetic traits were simulated with only additive effects using chromosome 1 from individuals of self-identified European ancestry in the UK Biobank. These data were then subsampled using sample sizes of 20,000, 50,000, and 100,000 individuals. We randomly selected 10% of all variants to be causal with additive effects, and we assume that they explain 40% of the phenotypic variance for each trait. Data were analyzed using both FAME (as a baseline) and SME under varying percentages of SNPs that are masked (0%, 95%, and 99%, respectively). The small insets in each plot show the distribution of the estimated marginal epistatic variance components across all experiments. For reference, under the null hypothesis H0:σ2=0. Results are based on 100 simulated traits per scenario.

Table 1.

While using a mask that induces uniform sparsity, SME controls type I error rates when synthetic traits are generated under the null model

Method Sample size α=0.05 α=0.01 α=0.001
FAME 20,000 0.0531 (0.0214) 0.0128 (0.0124) 0.0022 (0.0044)
FAME 50,000 0.0720 (0.0237) 0.0200 (0.0127) 0.0053 (0.0078)
FAME 100,000 0.1208 (0.0318) 0.0526 (0.0223) 0.0241 (0.0133)
SME (0% masked)
20,000 0.0379 (0.0185) 0.0056 (0.0074) 0.0001 (0.0010)
SME (0% masked)
50,000 0.0459 (0.0207) 0.0084 (0.0090) 0.0004 (0.0020)
SME (0% masked)
100,000 0.0492 (0.0217) 0.0085 (0.0090) 0.0007 (0.0029)
SME (0% masked)
300,000 0.0537 (0.0209) 0.0090 (0.0090) 0.0012 (0.0036)
SME (95% masked) 20,000 0.0359 (0.0163) 0.0046 (0.0061) 0.0005 (0.0022)
SME (95% masked) 50,000 0.0416 (0.0168) 0.0060 (0.0075) 0.0002 (0.0014)
SME (95% masked) 100,000 0.0445 (0.0197) 0.0078 (0.0080) 0.0004 (0.0020)
SME (95% masked) 300,000 0.0474 (0.0213) 0.0073 (0.0081) 0.0003 (0.0017)
SME (99% masked) 20,000 0.0310 (0.0147) 0.0029 (0.0052) 0.0000 (0.0000)
SME (99% masked) 50,000 0.0355 (0.0186) 0.0037 (0.0065) 0.0001 (0.0010)
SME (99% masked) 100,000 0.0387 (0.0178) 0.0042 (0.0062) 0.0001 (0.0010)
SME (99% masked) 300,000 0.0404 (0.0189) 0.0059 (0.0070) 0.0002 (0.0014)

Synthetic traits were simulated with only additive effects using chromosome 1 from individuals of self-identified European ancestry in the UK Biobank. These data were then subsampled using sample sizes of 20,000, 50,000, 100,000, and 300,000 individuals. A total of 100 causal additive variants were randomly selected for each trait, and their effects were assumed to explain 40% of the phenotypic variance. Data were analyzed using both FAME (as a baseline) and SME under varying percentages of SNPs that are masked (0%, 95%, and 99%, respectively). Empirical size for the analyses used significance thresholds of α = 0.05, 0.01, and 0.001. Values in the parentheses are the standard deviations of the estimates. Results are based on 100 simulations per scenario. Due to computational constraints, the data with 300,000 individuals were only analyzed with SME.

Notably, using an external data source with a localized sparse masking scheme introduces a slight negative bias in variance component estimates produced by SME, leading to fewer significant p values and more conservative inference (Figure S6). While type I error control remains conservative (Table S2), this also means that the test may have reduced power when traits are indeed simulated under the alternative with non-zero epistatic effects. We will explore this behavior further in the next section.

The masking strategy in SME leads to improved power in simulations

To assess the power of SME, we again generate synthetic continuous traits using real genotypes from chromosome 1 of White British individuals in the UK Biobank.2,3,5 These data were subsampled using sample sizes of 50,000, 100,000, and 300,000 individuals. Here, we assume that 10% of all SNPs are causal and have additive effects that collectively explain 30% of the trait variance. Next, we fix the epistatic contribution to the trait variance to be 5%, making the total broad-sense heritability 35%. We select a set of epistatic variants from the causal SNPs and divide them into two equally sized groups. Each SNP in one group is simulated such that they only interact with SNPs in the other group. This simulation design gives control over the epistatic PVE by the individual variants. In this analysis, we select 10, 20, 50, and 100 of the causal SNPs to be epistatic, which corresponds to per-SNP epistatic PVE values equal to 1%, 0.5%, 0.2%, and 0.1% of the trait variance.

Once again, we analyze SME using two different external data source types that induce uniform and localized sparsity—masking out 0%, 95%, and 99% of the possible interactive partners when constructing the marginal epistatic covariance matrix Gj for each j-th focal SNP being tested (Figure S5A). We compare the empirical power of SME to FAME as a baseline by assessing the respective abilities of both models to identify causal epistatic SNPs at a genome-wide significance threshold p<5×108.42

We find that using the uniformly sparse masking scheme significantly enhances the power of SME, with greater levels of sparsity leading to better method performance (Figure 4). When analyzing 300,000 individuals, the 99%-masked SME identifies at least 85.1% of the causal epistatic SNPs even when they contribute as little as 0.1% to the trait variance. This is compared to FAME and a non-masked SME, which only detect at most 1% of causal SNPs with very small PVE. When epistatic variants have larger effect sizes and individually account for 1% of the trait variance, the 99%-masked SME shows 99.8% power even with relatively small sample sizes (e.g., 50,000 individuals). Again, this is compared to FAME and a non-masked SME, which each only have approximately 35% power in this scenario. To examine the sensitivity of the SME to the specification of an external data source, we conducted a simulation in which the model was provided with a weight matrix that incorrectly masked true interacting partners.5 Here, we observed that the SME framework protects against the false discovery of non-additive genetic effects and underestimates the marginal epistatic variance component (σ2) when causal SNPs involved in pairwise interactions were unobserved (Figure S7).

Figure 4.

Figure 4

Uniform sparse modeling of interactions enhances the empirical power of SME

Synthetic traits were simulated with both additive and pairwise epistatic effects using chromosome 1 from individuals of self-identified European ancestry in the UK Biobank. Data were subsampled using sample sizes of 50,000, 100,000, and 300,000 individuals. We randomly selected 10% of all variants to have additive effects that collectively explained 30% of the trait variance. We then fixed the total epistatic variance to 5%. The per-SNP epistatic phenotypic variance explained (PVE) was adjusted by varying the number of interacting SNPs (chosen to be 10, 20, 50, or 100 SNPs). Data were analyzed using both FAME (as a baseline) and SME under varying percentages of variants that are excluded from consideration as potential interaction partners for each focal SNP (0%, 95%, and 99% masking, respectively). Empirical power was determined using the significance threshold p<5×108. Results are based on 100 simulations per scenario, with error bars representing the standard deviation across replicates.

Similar to the null simulation study, we see that SME produces negatively biased variance component estimates when using an external data source that induces localized sparsity in the model. Indeed, overconcentrating the search for potential interacting pairs to a select number of correlated variants leads to reduced empirical power compared to the masking that results in uniform sparsity (Figure S8). For example, when analyzing 300,000 individuals, the localized 99%-masked SME has just 6.5% power to identify epistatic variants that explain 0.1% of the trait variance. To overcome this issue in practice, we propose a strategy in which we take an external data source with localized genomic information and randomly unmask “unimportant” variants with uniform probability along the genome (essentially making the localized sparsity look more uniform, as shown in Figure S5B). As a demonstration of this idea, we implement a version of SME where we include an additional 1% and 5% of initially disregarded interactions back into the construction of Gj for each j=1,,J tested focal SNPs in the dataset. We find that adding this “noise” back into the mask reduces the negative bias of the variance component estimates and recovers as much as 28% of the power that was lost with respect to the uniformly sparse models (Figure S9). For future users of the SME software, we want to note that there is likely an application-specific trade-off between adding SNPs to a mask to reduce potential bias and finding the degree of sparsity needed for an optimally powered test.

SME uses chromatin information to identify epistasis in hematology traits

We apply SME to four hematology traits—MCH, mean corpuscular hemoglobin concentration (MCHC), mean corpuscular volume (MCV), and hematocrit (HCT)—assayed in 349,411 White British individuals in the UK Biobank19 and genotyped at 543,813 SNPs genome wide. As an external data source, we leverage DHS data measured over 12 days of ex vivo erythroid differentiation.33 Of the quality-controlled SNPs in our data, 4,932 of them are located in DHS regions enriched for transcriptional activity.27 Since previous GWAS results have found genes associated with MCH, MCHC, and MCV to also be implicated in erythroid differentiation,43 we expect that conditioning SME to test over regulatory mechanisms gathered during erythropoiesis will be helpful in identifying epistatic variants for these traits. On the other hand, HCT is a phenotype that measures the percentage of red blood cells in an individual. Since the regulation for this trait has little to do with DHS sites and more to do with oxygen available in the blood,44 we would expect a mask derived from functional data on erythropoiesis to not be helpful in enabling SME to detect epistasis.

For each trait, we use Manhattan plots to visually display the variant-level mapping results across each of the four traits, where chromosomes are shown in alternating colors for clarity (Figures 5 and S10–S12). Corresponding genes that have SNPs with p values below the genome-wide significance threshold to correct for multiple testing (p<5×108) are also highlighted. Importantly, many of the marginal epistatic variants identified by SME are supported by multiple published studies that have investigated non-additive gene action related to erythropoiesis and red blood cell traits (Table 2).

Figure 5.

Figure 5

Manhattan plots of a genome-wide interaction analysis using SME to study mean corpuscular hemoglobin assayed in individuals in the UK Biobank

As a mask in this study, we leveraged DNase I-hypersensitive site (DHS) data measured over 12 days of ex vivo erythroid differentiation.27,33 This means that while all SNPs are tested for marginal epistasis, only their interactions with SNPs in DHS regions are considered. Here, log10-transformed p values from SME are plotted for each SNP against their genomic positions. Chromosomes are shown in alternating colors for clarity. The dashed blue line represents the genome-wide significance threshold (p<5×108). Each image shows the same plot with different aspects of the result highlighted. The first simply shows the names of the closest neighboring genes to significant epistatic SNPs. The second image highlights the SNPs that fall in DHS regions, and the third image highlights SNPs that are also found to have a significant (additive) association with the trait according to a GWAS (material and methods).

Table 2.

SME identifies marginal epistasis in hematology traits from individuals in the UK Biobank

Trait ID Coordinates p value PVE p value (SME 0% masked) p value (MAPIT) Gene Reference
MCH rs4711092 chr6:25684405 1.41×1011 0.007 4.78×104 0.47 SCGN Qin et al.,45 Timoteo et al.46 Bauer et al.47
MCH rs9366624 chr6:25439492 1.80×109 0.011 5.75×107 9.37×103 CARMIL1 Ding et al.,43 Ray et al.,48 Timoteo et al.,46 Edwards et al.,49 Yang et al.56
MCH rs9461167 chr6:25418571 2.34×109 0.007 1.40×105 3.05×103 CARMIL1 Ding et al.,43 Ray et al.,48 Timoteo et al.,46 Edwards et al.,49 Yang et al.50
MCH rs9379764 chr6:25414023 5.53×109 0.012 4.58×106 0.21 CARMIL1 Ding et al.,43 Ray et al.,48 Timoteo et al.,46 Edwards et al.,49 Yang et al.,50 Vuckovic et al.,51 Zhang et al.52
MCH rs441460 chr6:25548288 1.20×108 0.008 3.10×106 0.36 CARMIL1 Ding et al.,43 Ray et al.,48 Timoteo et al.,46 Edwards et al.,49 Yang et al.50
MCH rs198834 chr6:26114372 2.77×108 0.008 2.17×106 1.83×103 H2BC4 Vuckovic et al.,53 Zhang et al.54
MCH rs13203202 chr6:25582771 3.17×108 0.012 1.90×104 0.13 CARMIL1 Ding et al.,43 Ray et al.,48 Timoteo et al.,46 Edwards et al.,49 Yang et al.50
MCV rs9276 chr6:33053577 9.09×1010 0.002 7.27×101 0.18 HLA-DPB1
MCV rs9366624 chr6:25439492 1.86×108 0.008 2.63×107 0.16 CARMIL1 Ding et al.,43 Ray et al.,48 Timoteo et al.,46 Edwards et al.,49 Yang et al.50

Here, we analyze 349,411 White British individuals in the UK Biobank genotyped at 543,813 SNPs genome-wide. Traits in this analysis included mean corpuscular hemoglobin (MCH), mean corpuscular hemoglobin concentration (MCHC), mean corpuscular volume (MCV), and hematocrit (HCT). As a mask, we leveraged DNase I-hypersensitive site (DHS) data measured over 12 days of ex vivo erythroid differentiation.27,33 Listed are only results corresponding to SNPs that have marginal epistatic p values below a genome-wide significance threshold to correct for multiple testing (p<5×108). In the second and third columns, we list SNPs and their genomic location in the format chromosome:base pair. Next, we give the p value and marginal epistatic phenotypic variance explained (PVE) for each SNP as estimated by SME. The next two columns report the resulting p values when using SME without an external data source (i.e., 0% masked) and MAPIT. The last columns detail the closest neighboring gene as well as a reference that has previously suggested some level of association or enrichment between each gene and the traits of interest. Due to computational resource constraints, MAPIT was only applied to a random subset of 10,000 individuals.

For example, when analyzing MCH, the strongest association identified by SME is the SNP rs4711092 (p=1.41×1011), which maps to the gene secretagogin (SCGN). SCGN regulates exocytosis by interacting with two soluble N-ethylmaleimide sensitive fustion attachment proteins (SNAP-25 and SNAP-23) and is critical for cell growth in some tissues.45 For MCH, SME also identified five significantly associated SNPs (e.g., rs9366624 with p=1.8×109) in the gene capping protein regulator and myosin 1 linker 1 (CARMIL1). CARMIL1 is known to interact with and regulate the capping protein (CP), which plays a role via protein-protein interactions in regulating erythropoiesis.48 Specifically, CARMIL proteins regulate actin dynamics by regulating the activity of the CP.55,56 Erythropoiesis leads to modifications in the expression of membrane and cytoskeletal proteins, whose interactions impact cell structure and function.57,58 Both SCGN and CARMIL1 have previously been associated with hemoglobin concentration.43,46 A complete list of the results for all traits is listed in Tables S4–S7. As a baseline for comparison, we also applied SME without an external data source (i.e., 0% masked) and MAPIT to all significant SNPs. The point of this analysis was to explore whether these traditional methods would have also identified the same sets of epistatic variants. Due to computational constraints, MAPIT was only implemented on a random subset of 10,000 individuals. Importantly, neither baseline identified any genome-wide significant associations (Table 2).

Conditioning SME on GWAS variants reveals epistasis in other complex traits

Next, we apply SME using a different external data source to four traits assayed in the same 349,411 White British individuals in the UK Biobank19 and genotyped at the 543,813 SNPs genome wide. These traits include body height, MCH, uric acid (which we refer to as urate), and VITDs. As an external data source, we leverage significant trait associations from GWAS summary statistics (material and methods). Here, we select the top 5,000 SNPs ranked by strength of association (i.e., lowest p values) for height and MCH and include all significant SNPs for urate (3,536 SNPs) and VITDs (547 SNPs). For each trait, we again use Manhattan plots to visually display the variant-level mapping results, with chromosomes shown in alternating colors for clarity (Figures S13–S16). The nearest genes mapped to the lead SNPs of peaks are highlighted.

Importantly, even when conditioning on GWAS summary statistics, SME identifies significant marginal epistatic variants for both height and urate (see Table S3). For example, when analyzing height, SME finds significant epistasis for the SNP rs9467442 (p=5.49×109), which maps to the gene cytidine monophospho-N-acetylneuraminic acid hydroxylase, pseudogene (CMAHP). Recently, CMAHP has been shown to be associated with body height for populations of European ancestry.59 A complete list of the results for all traits is listed in Tables S8–S11. Again, as a baseline for comparison, we apply SME without an external data source (i.e., 0% masked) and MAPIT to all significant SNPs identified by SME with masking. Due to computational constraints, MAPIT was only applied to a random subset of 10,000 individuals. Neither baseline had enough power to identify any significant associations at the genome-wide threshold (Table S3).

Findings with SME are robust to sample composition and phenotypic scaling

As a final analysis, we assess the replicability and robustness of the marginal epistatic variants identified by SME. First, for all traits with significant marginal epistasis (MCH, MCV, height, and urate), we replicate the application of SME using the respective external data sources (DHS and GWAS) in two independent subsamples of 174,705 White British individuals from the UK Biobank19 genotyped at 543,813 SNPs genome-wide. Across the split-half analyses, the overall results remained consistent, highlighting that the genomic loci selected by SME had stable marginal epistasis associations (Figures S17–S20).

Next, for all significant marginal epistasis associations, we apply two additional variations of SME and FAME (Table S12). First, we assess the scale invariance of the signal by quantile normalizing the traits. Second, we increase the stringency for which variants are excluded in the mask—here, in addition to removing variants within the LD block surrounding a focal SNP, we also exclude all variants on the same chromosome. After quantile normalization, most of the marginal epistasis signal remained significant; however, SME lost all power after excluding the entire chromosome where the focal SNP is located. As part of future work, it will be important to further distinguish statistically whether the observed marginal epistatic effects estimated by SME in these traits arise from cis-chromosome interactions or same-locus additive effects.60 Lastly, despite its inflated observed type I error, FAME does not find significant marginal epistasis at any of these loci.

Discussion

The marginal epistasis framework is an alternative to detect gene interactions. It derives its power by modeling the combined effect between a focal SNP and all other variants, thus alleviating the need to test every possible interaction separately. Still, current methods seeking to identify marginal epistasis struggle to scale to biobank-scale data and can be underpowered when non-additive genetic effects only explain a small portion of the overall trait variance.2,3 SME overcomes these limitations by inducing sparsity, essentially limiting the combined interaction for a focal SNP to just regions of the genome that have some known functional relationship with the quantitative trait of interest. This approach not only results in more efficient estimators but also offers a mechanism that allows the method to perform genome-wide analyses on modern datasets with runtimes that are magnitudes faster than previous approaches. Through extensive simulations, we show that SME controls type I error rates and produces calibrated p values. We also show that SME has the power to detect SNPs involved in epistasis even when they explain very small fractions of the trait variance. By analyzing hematology traits from participants in the UK Biobank, we illustrate that SME, informed by DNase-seq data, identifies statistical epistasis in variants for which previous research has also found interaction pathways. Split-half analyses on distinct subsets of the UK Biobank also show that the non-additive signal in these hematological traits is robust. Lastly, to showcase SME in the absence of biologically informative priors, we illustrate that SME identifies significant marginal epistasis in height and urate with sparsity induced by GWAS summary statistics. We make SME available as an open-source R software package to enable the broader community to easily use it in their research.

The current implementation of SME offers many directions for future development and applications. For example, the key to SME is that it relies on external data sources to induce sparsity in the model. This reliance on biologically informative priors to induce sparsity could serve as a limitation, as appropriate external data may be hard to identify in practice. While simulations show that misspecified or localized sparsity does not jeopardize the ability to control false positive rates, SME currently does not provide instructions on how to best format the external data for a particular analysis. In simulations, we show that some choices can induce a structure that leads to negative bias in the model estimates. We also show that adding random “noise” to the data can reduce this bias. As part of future work, we will explore how to automatically balance this trade-off within the software.

An important consideration when mapping epistasis in real data is that statistically inferred interactions in GWASs may arise from same-locus additive effects.60 Consequently, SME—like any other computational method for epistasis detection—may be confounded by additive effects from untyped or unmodeled variants in the same genomic region. For example, it was found in Hemani et al.13 that an initial set of signals pointing toward evidence of genetic interactions were better explained using linear models of unobserved variants in the same haplotype.5 The analysis of real traits from the UK Biobank presented in this work primarily serves to illustrate potential use cases and demonstrate the scalability of SME. In future work, we hope that SME will contribute to characterizing the role of epistasis in human traits. Such analyses should encompass a broader range of traits across multiple cohorts and incorporate various sparsity-inducing external data sources.

While SME uses an efficient model-fitting algorithm, its current implementation has a non-negligible input/output overhead from repeatedly needing to read in (often large) genotype data into memory. Future development that optimizes this file read bottleneck has the potential to further improve the scalability of the method. Lastly, the method currently only models quantitative traits. Future work could extend the advantages of sparse modeling of marginal epistasis to case-control traits.

Data and code availability

Source code and tutorials for implementing the SME are publicly available as an R package, which is available online on CRAN (https://cran.r-project.org/package=smer) and GitHub (https://github.com/lcrawlab/sme). The full list of summary statistics from the genome-wide interaction analysis using SME to study hematology traits in UK Biobank is publicly available at https://doi.org/10.5281/zenodo.14607997.

Acknowledgments

We thank members of the Weinreich and Crawford labs for insightful comments on earlier versions of this manuscript, as well as Bogdan Pasaniuc (University of Pennsylvania) and Roberta DeVito (Brown University) for helpful discussions. We also thank Ashok Ragavendran (Brown University) and the Computational Biology Core (CBC) for advice on software development. Lastly, we are incredibly grateful to Boyang Fu (UCLA) and Sriram Sankararaman (UCLA) for both technical and conceptual discussions about the implementation of FAME. This research was supported by a David & Lucile Packard Fellowship for Science and Engineering awarded to L.C. This research was conducted in part using computational resources and services at the Center for Computation and Visualization at Brown University. This research was also conducted using the UK Biobank Resource under application number 14649. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of any of the funders.

Author contributions

J.S. and L.C. conceived the study. J.S. and L.C. developed the methods. S.P.S. preprocessed and provided the data. J.S. developed the software and performed the analyses. D.W. and L.C. supervised the project and provided resources. All authors wrote and revised the manuscript.

Declaration of interests

L.C. is an employee of Microsoft Research and holds equity in Microsoft. S.P.S. is an employee of and holds equity in Genomics Ltd.

Published: July 29, 2025

Footnotes

Supplemental information can be found online at https://doi.org/10.1016/j.ajhg.2025.07.004.

Contributor Information

Julian Stamp, Email: julian_stamp@brown.edu.

Lorin Crawford, Email: lcrawford@microsoft.com.

Web resources

Supplemental information

Document S1. Figures S1–S20 and Tables S1–S12
mmc1.pdf (6.5MB, pdf)
Document S2. Article plus supplemental information
mmc2.pdf (13.3MB, pdf)

References

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Document S1. Figures S1–S20 and Tables S1–S12
mmc1.pdf (6.5MB, pdf)
Document S2. Article plus supplemental information
mmc2.pdf (13.3MB, pdf)

Data Availability Statement

Source code and tutorials for implementing the SME are publicly available as an R package, which is available online on CRAN (https://cran.r-project.org/package=smer) and GitHub (https://github.com/lcrawlab/sme). The full list of summary statistics from the genome-wide interaction analysis using SME to study hematology traits in UK Biobank is publicly available at https://doi.org/10.5281/zenodo.14607997.


Articles from American Journal of Human Genetics are provided here courtesy of American Society of Human Genetics

RESOURCES