Epi-SSA: A novel epistasis detection method based on a multi-objective sparrow search algorithm

Liyan Sun; Jingwen Bian; Yi Xin; Linqing Jiang; Linxuan Zheng

doi:10.1371/journal.pone.0311223

. 2024 Oct 24;19(10):e0311223. doi: 10.1371/journal.pone.0311223

Epi-SSA: A novel epistasis detection method based on a multi-objective sparrow search algorithm

Liyan Sun ^1,^*, Jingwen Bian ², Yi Xin ¹, Linqing Jiang ¹, Linxuan Zheng ¹

Editor: Muhammad Abdul Rehman Rashid³

PMCID: PMC11500897 PMID: 39446852

Abstract

Genome-wide association studies typically considers epistatic interactions as a crucial factor in exploring complex diseases. However, the current methods primarily concentrate on the detection of two-order epistatic interactions, with flaws in accuracy. In this work, we introduce a novel method called Epi-SSA, which can be better utilized to detect high-order epistatic interactions. Epi-SSA draws inspiration from the sparrow search algorithm and optimizes the population based on multiple objective functions in each iteration, in order to be able to more precisely identify epistatic interactions.

To evaluate its performance, we conducted a comprehensive comparison between Epi-SSA and seven other methods using five simulation datasets: DME 100, DNME 100, DME 1000, DNME 1000 and DNME3 100. The DME 100 dataset encompasses eight second-order epistasis disease models with marginal effects, each comprising 100 simulated data instances, featuring 100 SNPs per instance, alongside 800 case and 800 control samples. The DNME 100 encompasses eight second-order epistasis disease models without marginal effects and retains other properties consistent with DME 100. Experiments on the DME 100 and DNME 100 datasets were designed to evaluate the algorithms’ capacity to detect epistasis across varying disease models. The DME 1000 and DNME 1000 datasets extend the complexity with 1000 SNPs per simulated data instance, while retaining other properties consistent with DME 100 and DNME 100. These experiments aimed to gauge the algorithms’ adaptability in detecting epistasis as the number of SNPs in the data increases. The DNME3 100 dataset introduces a higher level of complexity with six third-order epistasis disease models, otherwise paralleling the structure of DNME 100, serving to test the algorithms’ proficiency in identifying higher-order epistasis. The highest average F-measures achieved by the seven other existing methods on the five datasets are 0.86, 0.86, 0.41, 0.56, and 0.79 respectively, while the average F-measures of Epi-SSA on the five datasets are 0.92, 0.97, 0.79, 0.86, and 0.97 respectively. The experimental results demonstrate that the Epi-SSA algorithm outperforms other methods in a variety of epistasis detection tasks. As the number of SNPs in the data set increases and the order of epistasis rises, the advantages of the Epi-SSA algorithm become increasingly pronounced.

In addition, we applied Epi-SSA to the analysis of the WTCCC dataset, uncovering numerous genes and gene pairs that might play a significant role in the pathogenesis of seven complex diseases. It is worthy of note that some of these genes have been relatedly reported in the Comparative Toxicogenomics Database (CTD). Epi-SSA is a potent tool for detecting epistatic interactions, which aids us in further comprehending the pathogenesis of common and complex diseases. The source code of Epi-SSA can be obtained at https://osf.io/6sqwj/.

Introduction

Despite the significant progress made in identifying genes related to Mendelian genetic diseases, parsing the genetic basis of non-Mendelian (i.e., complex diseases) faces even more arduous challenges [1–4]. This challenge mainly stems from the phenomenon of epistasis, which significantly increases the complexity of genetic analysis. The prevailing view is that complex diseases are not caused by a single gene, but the result of the combined action of variations in multiple genes. These variations have a significant cumulative effect on the disease as a whole, although individually, their impact on the individual may be negligible. This cumulative effect is commonly referred to as epistatic interactions or multi-locus interactions [5–8].

With the rise of high-throughput genotyping and sequencing technologies, we are able to explore millions of single nucleotide polymorphisms (SNPs) at the individual level [9–12]. Genome-wide association studies (GWAS) as an emerging strategy has significantly promoted our understanding of the genetic basis of common and complex diseases. GWAS covers the genotyping analysis of hundreds of thousands of SNPs in thousands of individuals. In these studies, the detection of epistatic interactions provides a new perspective for disease genetics, thereby helping us to more fully understand these diseases and simultaneously providing new avenues for the prevention, diagnosis, and treatment of diseases [13–16].

In recent years, researchers have proposed a variety of strategies aimed at detecting epistatic interactions present in GWAS.

SNPHarvester [17] screens out SNP clusters significantly associated with the disease through multiple pathways and selects significant SNP clusters through rigorous statistical review. SNPRuler [18] is a method based on predictive rule reasoning to find epistatic interactions related to diseases, and it is the first method to ensure that it can find epistatic interactions without exhaustive search. BOOST [19] is a rapid method for detecting epistatic interactions. This method comprises two main steps: first, it employs a logical operation strategy to preliminarily screen pairs of SNPs; second, it carries out precise statistical analysis on the screened SNP pairs to evaluate their significant association with specific diseases. AntEpiSeeker [20] uses an innovative two-stage ant colony optimization algorithm to identify epistatic interactions within the framework of case-control studies. MACOED [21] as a multi-objective heuristic optimization method integrates logic regression and Bayesian network technology into the ant colony optimization algorithm to enhance the efficiency of the study. FHSA-SED [22] utilizes the Harmony Search Algorithm to identify 2-order epistasis in GWAS data. This method enhances the algorithm’s capacity to detect epistasis by combining both K2 and Gini as the objective functions for optimization. DECMDR [23] combines the differential evolution algorithm and classification-based multi-factor dimension reduction techniques, using CMDR as a fitness measure to explore potential epistatic interactions in GWAS. HS-MMGKG [24] combines harmony optimization algorithms and multiple optimization objectives and uses a novel strategy to combine the p-value and MDR method to increase the accuracy of the detection results. SEE [25] integrates eight evolutionary objectives and uses a new strategy based on sorting, exploration, and utilization to assess the association between SNP combinations and phenotypes. DL-GWAS [26] represents a cutting-edge deep-learning framework that employs convolutional neural networks (CNNs) to predict quantitative traits from SNPs in soybean genomes, eliminating the necessity for genotype imputation. This CNN-based model not only achieves higher accuracy but also demonstrates superior efficiency compared to conventional statistical approaches, providing a powerful tool for conducting genome-wide association studies. The multipopulation harmony search algorithm is specifically designed to identify high-order epistasis interactions [27]. This algorithm employs a multipopulation strategy to intensify the exploration of solution spaces, thereby effectively pinpointing intricate SNP interactions that are vital for comprehending the genetic foundation of diseases. Furthermore, it integrates a dynamic search mechanism that adjusts to the complexity of genetic data, rendering it a potent instrument for unearthing multiorder epistasis in biomedical research. SHEIB-AGM [28] is a random method based on an automatic gene matrix, in each iteration, it detects epistatic interactions on a higher-order SNP combination randomly according to the content of the gene matrix, and updates the gene matrix according to the detection results to ensure the detection capability of the algorithm. MP-HS-DHSI [27] is a multi-population Harmony Search Algorithm dedicated to the detection of high-order SNP interactions. It uses multiple criteria and multi-harmony memories to discover a set of candidate high-order SNP combinations associated with disease status. DeepCOMBI [29] utilizes CNNs within a deep-learning framework to predict phenotypes from SNPs in the context of GWAS. This innovative method not only achieves superior accuracy in phenotype prediction but also enhances the identification of genetic markers associated with complex traits, all without requiring genotype imputation. Furthermore, DeepCOMBI introduces an element of explainability to deep learning by employing layer-wise relevance propagation (LRP), which demystifies the decision-making processes of the CNNs. This transparent approach allows for the precise identification and selection of the most pertinent SNPs, which are then subjected to rigorous statistical testing, thereby enriching the discovery of significant genetic associations. BitEpi [30] is a fast and accurate method to test all possible combinations of up to four bi-allelic variants. It introduces a novel bitwise algorithm which is faster than established software and proposes a novel entropy statistic which is more accurate. EpiMOGA [31] is a multi-objective Genetic Algorithm for epistasis detection. It employs K2 and Gini to guide the search process of the genetic algorithm. A novel algorithm employs a multitasking framework that enhances the traditional Ant Colony Optimization by integrating a sophisticated pheromone update mechanism and local search heuristics, effectively navigating the intricate solution space to identify higher-order SNP interactions with precision [32]. This innovative method stands out for its ability to balance exploration and exploitation, ensuring a comprehensive and efficient search strategy. The paper proposes a novel hybrid algorithm that integrates membrane computing and harmony search for gene selection from expression and methylation data [33]. This unique method leverages the computational prowess of membrane computing in conjunction with the optimization potential of the harmony search algorithm, thereby facilitating the efficient identification of pertinent genes in bioinformatics analysis. SFMOABC [34] is a multi-objective Artificial Bee Colony Algorithm based on the scale-free network. It incorporates the scale-free network into the optimization to guide the update and selection of solutions. The Multitasking Harmony Search Algorithm-DHEI (MTHSA-DHEI) [35] is a sophisticated algorithm that utilizes a harmonious search framework specifically tailored for multitasking purposes. The primary objective of this algorithm is to proficiently navigate the solution space and pinpoint high-order Single Nucleotide Polymorphism (SNP) interactions. The paper proposes a novel method that innovatively employs a harmony search framework with explicit encoding to efficiently identify intricate genetic interactions [36]. This method distinguishes itself by its capacity to manage multiple tasks concurrently, thereby optimizing the search for high-order SNP interactions with precision. Although many algorithms for detecting epistatic interactions have emerged in recent years, they still have shortcomings in accuracy and efficiency. In this work, we proposes a new algorithm, Epi-SSA, which detects higher-order epistatic interactions in GWAS data through the Sparrow Search Algorithm (SSA) [37–40]. Compared to other existing algorithms, Epi-SSA exhibits the following significant advantages:

Epi-SSA adopts an optimization strategy based on multiple objective functions, which can comprehensively evaluate SNP combinations related to the disease state, and enhances the depth of analysis of GWAS data.
Epi-SSA can automatically identify the order of significant epistatic interactions related to the disease state without the need for users to preset the epistasis order, which is more in line with the actual needs of GWAS research.
Epi-SSA generates new individuals based on the SNP weight vector in the iteration and updates the SNP weight vector in each iteration based on the detection results, guiding the evolution direction of the population.
Epi-SSA solves the problem of false positives in detecting epistatic interactions based on a new strategy.

To verify the detection capability of Epi-SSA, we conducted comparative experiments between Epi-SSA and existing algorithms such as AntEpiSeeker, DECMDR, HS-MMGKG, SEE, SHEIB-AGM, SNPHarvester, and SNPRuler on five simulated datasets. These datasets cover 22 types of epistasis models and 3,800 simulated data files. The experimental results show that Epi-SSA demonstrates superior performance beyond other algorithms in the detection ability on the simulated dataset, especially in the identification of 3-order epistasis.

Materials and methods

This work introduces a new algorithm Epi-SSA, which is based on the idea of the Sparrow Search Algorithm and focuses on identifying epistatic interactions related to diseases in GWAS data. This method guides the sparrow population to evolve in the direction of reducing the values of multiple objective functions through repeated iterative processes, and then obtains epistatic interactions related to diseases by detecting the partial optimal sparrows in each generation. The overall structure and execution flow of the algorithm are presented in Fig 1, with the subsequent sections of the text offering a meticulous breakdown of each procedural step.

Determine the values of mo and ml

The Epi-SSA algorithm uses the maximum epistasis order (mo) [28] to define the upper limit of the order of epistatic interactions related to diseases that it can detect. The setting of this parameter can either be specified by the user according to the research requirements, or automatically calculated based on the number of samples in the GWAS data. The specific calculation method is shown in Eq (1). The purpose of setting mo is to limit the length when the evaluation function processes SNP combinations, and to ensure that throughout the entire operation process of the algorithm, the average number of samples for each genotype combination remains at the level of the natural constant e. This strategy effectively reduces the risk of the evaluation function possibly failing due to processing overly long SNP combinations, thereby enhancing the stability and reliability of the algorithm.

\begin{matrix} m o = ⌊ ln (min (m 0, m 1)) - 0.5 ⌋ \end{matrix}

(1)

Where mo represents the maximum epistasis order, m0 is the number of normal samples in the GWAS data, and m1 is the number of disease samples in the GWAS data.

The Epi-SSA algorithm restricts the scale of the contingency table generated during the calculation process of the evaluation function by setting the maximum length of the contingency table (ml). When using the evaluation function to analyze the correlation between SNP combinations and diseases, even for SNP combinations with the same length, due to the possible lack of samples for some genotype combinations, there is a difference in the actual length of the non-zero contingency table. This difference may lead to calculation deviations of the evaluation function on contingency tables of different lengths. Generally, the longer the length of the non-zero contingency table, the more significant the correlation between the SNP combination and the disease it reflects. In order to fairly assess this correlation, the Epi-SSA algorithm introduces a mechanism to control the length of the contingency table in the evaluation process. The setting of this parameter can either be specified by the user according to the research requirements, or automatically calculated based on the number of samples in the GWAS data. The specific calculation method is shown in Eq (2).

\begin{matrix} m l = ⌊ \frac{min (m 0, m 1)}{10} ⌋ \end{matrix}

(2)

Where ml represents the maximum length of the contingency table, and the definitions of m0 and m1 are consistent with Eq (1).

Initialize the positions of the sparrow population

Randomly generate n vectors with a length of mo, which represent the positions of n sparrows in the population. The position vector of each sparrow is defined according to Eq (3), which elaborately describes the composition of the position vector. During the iteration of the algorithm, the position vectors of the sparrows in the population will undergo a continuous optimization process, which aims to identify the gene epistatic interactions related to the disease.

\begin{matrix} X_{i} = s_{i, 1}, s_{i, 2}, \dots, s_{i, j}, \dots, s_{m o - 1}, s_{m o} \end{matrix}

(3)

Where X_i represents the position vector of the i-th sparrow in the population, and i ∈ [1, n]. Each element s_i,j in the vector s_i corresponds to the s_i,j-th SNP in the GWAS dataset, and s_i,j ∈ [1, N], where N represents the total number of SNPs in the GWAS dataset.

Calculate the three objective functions of the sparrows in the population

During the process of optimizing the population, the Epi-SSA algorithm adopts three objective functions to evaluate the position vector of each sparrow. These objective functions include K2, CE, and Gini, which are widely used when detecting epistatic interactions in GWAS data [27, 31, 34]. The detailed calculation methods of these functions are described in detail in Eq (4). They measure the correlation between the sparrow position vector and the disease from multiple dimensions. The lower the values of these objective functions, it indicates that the correlation between the corresponding position vector and the disease is more significant.

\begin{matrix} k 2 (X, Y) = & \prod_{x \in X G} \frac{(| Y G | - 1)!}{(m_{x} + | Y G | - 1)} \times \prod_{y \in Y G} m_{x y}! \\ c e (X, Y) = & \sum_{x \in X G, y \in Y G} p (x, y) \times log \frac{p (x, y)}{p (x)} \\ g i n i (X, Y) = & \sum_{x \in X G} p (x) \times \sum_{y \in Y G} p (y | x) \times (1 - p (y | x)) \end{matrix}

(4)

Where X is the vector of the sparrow’s position, and Y is the disease status of the sample. We use the K2 value (k2(X, Y)), the CE value (ce(X, Y)), and the Gini value (gini(X, Y)) to quantify the correlation between X and Y. XG represents the set of all possible combined genotypes corresponding to X. For instance, for a vector X with a length of 2, XG includes all possible genotype combinations, namely (0, 0), (0,1), (0,2), (1,0), (1,1), (1,2), (2,0), (2,1), (2,2). YG represents the set of sample states, and in the research of this work, it only comprises two values: 0 represents normal samples, and 1 represents disease samples. m_x is the number of samples in the data that have a specific combined genotype x on the SNPs corresponding to X, and m_x,y is the number of samples that have the combined genotype x and the sample state is y. p(x, y) is the ratio of mx, y to the total number of samples in the data, p(x) is the ratio of m_x to the total number of samples in the data, and p(y|x) is the ratio of mx, y to m_x, which reflects the conditional probability that the sample state is y given a genotype x.

Although the three objective functions listed in Eq (4) are widely used in the algorithms for detecting epistatic interactions, they have a common limitation: these functions are calculated based on the contingency table between X and Y, and show significant sensitivity to the length of XG. Specifically, as the length of XG increases, the values of these three objective functions tend to decrease. To address this issue, Epi-SSA introduces an optimization strategy that is applicable to these three objective functions, which reduces the dependence on the length of XG by limiting the length of the contingency table to not exceed ml. The specific operation is as follows: Through the analysis of Eq (4), we can find that the value of each objective function is obtained through the cumulative summation of XG, and the smaller the value of the objective function, the stronger the correlation between X and Y. Therefore, during the calculation process, we sort the SNP combined genotypes on the contingency table and only retain the values of the smallest ml − 1 cells. For the remaining cells, we combine their samples to ensure that the maximum length of the contingency table does not exceed ml. This method effectively alleviates the bias of the objective function towards the length of XG and improves the accuracy and applicability of the algorithm.

In Epi-SSA, in order to effectively integrate the three objective functions as the optimization objectives of the population in the Sparrow Search Algorithm, we adopt a rank-based sorting strategy, akin to that utilized in SEE [25]. This process can be detailed through Fig 2. The specific steps are as follows: Firstly, we sort each sparrow in the population according to the independent values of each objective function, in order to determine the rankK2, rankCE, and rankGini values of each sparrow. Secondly, we accumulate the rankK2, rankCE, and rankGini values of each sparrow to obtain a comprehensive rank sum rankSum. Finally, we sort the sparrows in the population according to the rankSum value, arranged from small to large. According to the definitions of the three objective functions, the lower the rankSum value, it means that under the comprehensive consideration of these three objective functions, the correlation between the corresponding sparrow’s position vector and the disease state is more significant. Through this rank-based sorting method, Epi-SSA can efficiently identify and select the sparrows with a stronger correlation with the disease state within the multi-objective optimization framework, so as to optimize the performance of the algorithm.

Update the positions of the producers

Based on the design concept of the Sparrow Search Algorithm, during the iteration, Epi-SSA divides the population into producers and scroungers, with the producers having a better fitness in the population compared to the scroungers. Alg 1 shows the process of updating the positions of the producers. Among them, the value range of the parameter pd is a decimal ranging from 0 to 1, representing the proportion of the producers in the population, which is 0.4 by default. The parameter n denotes the number of sparrows in the population. The parameter st is the safety threshold, and its value range is a decimal between 0.5 and 1, and the default value is 0.8. The parameter mo is the maximum epistasis order, which is specified by the user or obtained through calculation according to Eq (1).

Algorithm 1: Update the positions of the producers

Input: pd, n, st, mo.

Output: The position vectors of the sparrow population have been updated.

1 if rand(0,1) < st then

2 for i ← 1 to pd × n do

3 Refer to the position vector of the i-th sparrow in the population as X_i.

4 Randomly select one from all the SNPs in the GWAS data to replace a random position in X_i, resulting in a new position vector of the sparrow, newX.

5 Add newX to the population.

6 Update the rankSum of the population and keep the population ordered.

7 if the rankSum of newX <the rankSum of X_i then

8 remove X_i from the population.

9 else

10 remove newX from the population.

11 end

12 end

13 else

14 for i ← 1 to pd × n do

15 Refer to the position vector of the i-th sparrow in the population as X_i. Randomly select $\frac{m o}{2}$ SNPs from the GWAS data to replace $\frac{m o}{2}$ random positions in X_i to generate a new sparrow position vector newX. Add newX to the population. Update the rankSum of the population and keep the population ordered. if the rankSum of newX <the rankSum of X_i then

16 remove X_i from the population.

17 else

18 end

19 remove newX from the population.

20 end

21 end

Update the positions of the scroungers

After the position vector of the producers has been updated, the Epi-SSA algorithm needs to update the scroungers in the population. The idea is for the better scroungers to move towards the producers, while the poorer scroungers move in a random direction, attempting to let the scroungers find a better position. The specific update method can refer to Alg 2, where the parameter pss vector records the probability of each SNP being selected in the GWAS data, and its dimension is consistent with the number of SNPs contained in the GWAS data. The Epi-SSA algorithm initializes the pss vector to an array comprised entirely of 1 in the startup phase. Along with the iterative process of the algorithm, the element values of the pss vector are correspondingly adjusted based on the results of the detected epistasis interactions in each iteration. The definition of the other parameters is the same as in Alg 1.

Algorithm 2: Update the positions of the scroungers

Input: pd, n, st, mo,pss.

Output: The position vectors of the sparrow population have been updated.

1 for i ← pd × n + 1 to n do

2 if $i < \frac{n}{2}$ then

3 Refer to the position vector of the i-th sparrow in the population as X_i.

4 Randomly select two producers in the population, and mark the producer with the lower rankSum as P.

5 Randomly select $\frac{m o}{2}$ SNPs from both X_i and P respectively to form a new vector with a length of mo and denoted as newX.

6 Add newX to the population.

7 Update the rankSum of the population and keep the population ordered.

8 if the rankSum of newX <the rankSum of X_i then

9 remove X_i from the population.

10 else

11 remove newX from the population.

12 end

13 else

14 From the GWAS data, randomly select mo SNPs according to the probability vector pss to form a new vector, denoted as newX.

15 Add newX to the population.

16 Update the rankSum of the population and keep the population ordered.

17 if the rankSum of newX <the rankSum of X_i then

18 remove X_i from the population.

19 else

20 remove newX from the population.

21 end

22 end

23 end

Generate n × sd new sparrows

After updating the position vector of the scroungers, the idea of the Sparrow Search Algorithm is that when the sparrows at the edge of the group perceive a threat, they will move towards the core area of the group; meanwhile, the sparrows in the center of the group will also conduct random exploration. In view of the characteristics of the GWAS data, the Epi-SSA algorithm simulates this behavior by generating some new sparrows. Specifically, the algorithm first randomly selects n × sd sparrows in the population. For each selected sparrow a, if it is the best sparrow in the current population, then the algorithm will randomly select mo SNPs from all the SNPs in the GWAS data according to the probabilities stored in the pss vector, form a new position vector (sparrow), and incorporate it into the population. If a is not the best sparrow, the algorithm will randomly select a producer b that is better than a, and then select half of the SNPs from the position vectors of both a and b to combine into a new position vector, and add this new vector (sparrow) to the population.

Detect epistatic interactions in the population

The Epi-SSA algorithm employs a strategy that utilizes the K2 function to detect epistasis of order 2 to mo on mo-order SNP combinations. The core idea can be summarized as follows: Consider an SNP combination X, whose K2 value is calculated by Eq (4) and denoted as k2_X. When an SNP x is removed from X, a new SNP combination R is formed, and its K2 value is denoted as k2_R. If x is an SNP associated with the disease, or if x interacts with other SNPs in X to affect the disease (showing epistasis), then k2_R is greater than k2_X; on the contrary, if x is noise, then k2_R should be less than or equal to k2_X.

The Epi-SSA algorithm, based on this concept, takes the following steps to detect epistasis:

Select n × sd optimal sparrows from the population (the top n × sd ranked in the population).
For each sparrow position vector X, repeatedly attempt to remove all noise SNPs in X based on the K2 value.
After the above process, a purified noise-free SNP combination R is obtained. If the length of R is greater than 1, use the G-test (according to Eq (5)) to assess the significance of the association between R and the disease.
If the significance of the association between R and the disease is less than or equal to the user-defined threshold, record R as one of the results.
To reduce the algorithm’s repeated focus on detected SNPs, in each iteration, update the weight pss[x] of each SNP x in X, with the formula pss[x] updated to pss[x] × 0.9.

\begin{matrix} F (X, Y) = & (| X G | - 1) \times (| Y G | - 1) . \\ E (x, y) = & m_{x} \times \frac{m_{y}}{m} . \\ g (X, Y) = & p v a l u e_{O f G} (2 \times \sum_{x \in X G} \sum_{y \in Y G} ln \frac{m_{x, y}}{E (x, y)}) . \end{matrix}

(5)

Where g(X, Y) denotes the p-value obtained from the G-test for independence, which is used to evaluate the association between the SNP combination X and the phenotype Y. The significance of the relationship between R and the disease is assessed through g(R, Y). The variable F(X, Y) represents the degrees of freedom associated with the independence test. The count of samples with the SNP combination genotype x is denoted by m_x, and the number of samples exhibiting phenotype y is given by m_y. The total sample size is indicated by m, while E(x, y) is the expected count of samples with genotype x and phenotype y. The function pvalue_OfG calculates the p-value under the chi-square distribution, based on the statistical measures provided. The meanings of the remaining symbols are consistent with previous descriptions.

Local optimum has been reached

The Epi-SSA algorithm assesses whether the search process has reached local optimum by analyzing the proportion of distinct SNPs in the population. The specific calculation method is shown in Eq (6). When the algorithm detects a local optimum, it will remove the top n × sd sparrows from the population; otherwise, it will remove the bottom n × sd sparrows.

\begin{matrix} s p a s C h a o s = \frac{n u m b e r o f d i s t i n c t S N P s}{n \times m o} \end{matrix}

(6)

Where spasChaos is utilized to assess whether the population has achieved the local optimum state. If spasChaos is less than the user-specified thresholdSpasChaos (with a default value of 0.6), Epi-SSA considers that the population has reached the local optimum state. numSnps refers to the total number of de-duplicated SNPs in the population, n represents the number of sparrows in the population, and mo indicates the maximum epistasis order.

Generate the results

To reduce the false positives in the detection results, Epi-SSA proposes a new strategy to filter the results of epistasis detection, aiming to filter out the epistasis with relatively weaker association with the disease as noise. The specific steps are as follows:

Sort all the epistasis in the results based on the significance of the G-test from strong to weak (ascending order of p-value).
Assume that a total of ne epistatic interactions are detected. For each i ∈ [2, ne], calculate the ratio of the significance of the ith epistasis to the significance of the (i − 1)th epistasis, and record the i value corresponding to the largest ratio as iBiggest.
Output the epistasis ranked before iBiggest in the results as the final detected epistasis to the result file, and ignore the epistasis ranked after iBiggest as noise.

By adopting this strategy, Epi-SSA greatly reduces the false positives in the detection results while maintaining the detection accuracy of the algorithm.

Results and discussion

Experiments on simulated datasets

To evaluate the capability of the Epi-SSA algorithm in the task of epistasis detection, this work carefully selected five simulated datasets to ensure a comprehensive assessment of the algorithm’s capabilities, the datasets can be obtained at https://osf.io/6sqwj/. The following is a detailed description of these datasets:

DME 100 dataset: This dataset consists of 8 DME models, each model containing 100 GWAS simulated data files. Each file contains 100 SNPs, as well as 800 case and control samples. These models are derived from the DECMDR algorithm, and their penetrance tables can be found in S1 Table.
DNME 100 Dataset: This dataset consists of 8 DNME models, each model also containing 100 GWAS simulated data files. Each file contains 100 SNPs, as well as 800 case and control samples. The DNME models were generated by the GAMETES [41] software, employing different minor allele frequency (MAF) value ranges [0.2, 0.4] and heritability value ranges [0.025, 0.05, 0.1, 0.2]. The relevant penetrance tables can be found in S2 Table.
DME 1000 Dataset: This dataset is similar to the DME 100 dataset, with the only difference being that the number of SNPs contained in each GWAS data file has been increased to 1000.
DNME 1000 Dataset: This dataset is similar to the DNME 100 dataset, with the only difference being that the number of SNPs in each GWAS data file has been increased to 1000.
DNME3 100 Dataset: This dataset is composed of 8 DNME3 models, each model containing 100 GWAS simulated data files. Each file includes 100 SNPs, as well as 800 case and control samples. These models were generated by the GAMETES software, using different MAF value ranges [0.2, 0.4] and heritability value ranges [0.05, 0.1, 0.2]. The relevant penetrance tables can be found in S3 Table.

In this work, to evaluate the ability of different algorithms to detect epistasis on simulated datasets, we chose F-measure and Power as the metrics to measure the detection performance. These metrics are widely used when assessing the effectiveness of epistasis detection algorithms on simulated datasets [21, 24, 28, 31], and their calculation formulas are detailed in Equation Eq (7).

\begin{matrix} R e c a l l = & \frac{T P}{T P + F N} \\ P r e c i s i o n = & \frac{T P}{T P + F P} \\ F - m e a s u r e = & \frac{2}{\frac{1}{r e c a l l} + \frac{1}{p r e c i s i o n}} \\ P o w e r = & \frac{S}{100} . \end{matrix}

(7)

The higher the values of F-measure and Power, the better the algorithm performs in identifying epistatic interactions in the simulated dataset. When the algorithm performs epistasis detection on simulated GWAS data files and outputs results, TP represents the number of pathogenic epistatic interactions correctly detected. FN represents the number of pathogenic epistatic interactions that were not correctly detected. FP represents the number of SNP combinations unrelated to the disease that were incorrectly detected. The F-measure, as the harmonic mean of recall and precision, provides a quantitative measure of the overall performance of the algorithm. Specifically, the F-measure of the algorithm on a particular simulated model is determined by calculating the average F-measure of 100 simulated data files under that model. S represents the number of pathogenic epistasis SNP combinations accurately identified by the algorithm in 100 simulated data files.

In this work, we conducted an in-depth simulation experiment analysis of a series of algorithms, aiming to evaluate their ability to identify epistatic interactions. The algorithms involved include AntEpiSeeker, DECMDR, HS-MMGKG, SEE, SHEIB-AGM, SNPHarvester, SNPRuler, and Epi-SSA. Table 1 shows the parameters used by these algorithms on different simulated datasets in this paper. To ensure a fair comparison, the population size and the number of iterations are kept consistent when all algorithms are tested on the same dataset. Among them, AntEpiSeeker and SNPHarvester cannot detect 3-order epistasis, hence they cannot be executed on the DNME3 100 dataset.

Table 1. Algorithm parameter settings employed in the experiments on simulated data.

			parameter settings
Algorithm	language	on all datasets	DME 100	DNME 100	DME 1000	DNME 1000	DNME3 100
AntEpiSeeker	C++	alpha = 1 iTopModel = 80 iTopLoci = 16 rou = 0.05 phe = 100 largehapsize = 6 smallhapsize = 3 pvalue = 0.05	iAntCount = 20 iItCountLarge = 25 iItCountSmall = 75 iEpiModel = 2	iAntCount = 20 iItCountLarge = 40 iItCountSmall = 120 iEpiModel = 2	iAntCount = 40 iItCountLarge = 1500 iItCountSmall = 4500 iEpiModel = 2	iAntCount = 40 iItCountLarge = 2000 iItCountSmall = 6000 iEpiModel = 2	Unable to execute
DECMDR	Java	s = 1 m = 0.5 r = 0.5	p = 20 g = 100 o = 2	p = 20 g = 160 o = 2	p = 40 g = 6000 o = 2	p = 40 g = 8000 o = 2	p = 20 g = 4000 o = 3
HS-MMGKG	Java	nsolution = 0 hmcr = 0.8 par = 0.4 fold = 5 pvalue = 0.05	hms = 20 tmax = 100 order = 2	hms = 20 tmax = 160 order = 2	hms = 40 tmax = 6000 order = 2	hms = 40 tmax = 8000 order = 2	hms = 20 tmax = 4000 order = 3
SEE	C++	pe = 0.4 stepInTable = 4 rn = 1 cG = 0.05	numPop = 20 maxIter = 100 order = 2	numPop = 20 maxIter = 160 order = 2	numPop = 40 maxIter = 6000 order = 2	numPop = 40 maxIter = 8000 order = 2	numPop = 20 maxIter = 4000 order = 3
SHEIB-AGM	Java	cG = 0.05 o = −1 pb = 0.8 cGc = 1	maxGen = 2000	maxGen = 3200	maxGen = 240000	maxGen = 320000	maxGen = 80000
SNPHarvester	Java	there is no parameter					Unable to execute
Epi-SSA	Java	pd = 0.4 sd = 0.2 st = 0.8 ml = 0 mo = 0 seed = 0 cG = 0.05 thresholdSpasChaos = 0.6	n = 100 maxG = 20	n = 20 maxG = 160	n = 40 maxG = 6000	n = 40 maxG = 8000	n = 20 maxG = 4000

Open in a new tab

Fig 3 displays a comparative analysis of the F-measure of different algorithms on the DME 100 dataset. More detailed results can be found in S4 Table. The average F-measure and standard deviation of these algorithms on the DME 100 dataset are as follows: AntEpiSeeker (0.09, 0.03), DECMDR (0.29, 0.26), HS-MMGKKG (0.01, 0.01), SEE (0.05, 0.03), SHEIB-AGM (0.86, 0.09), SNPHarvester (0.67, 0.28), SNPRuler (0.47, 0.23), and Epi-SSA (0.92, 0.03). For the comparison results of Power and Execution time on the DME 100 dataset, please refer to S5 and S6 Tables, S1 and S2 Figs. The average Power and standard deviation of these algorithms on the DME 100 dataset are as follows: AntEpiSeeker (0.72, 0.11), DECMDR (0.29, 0.26), HS-MMGKKG (0.07, 0.08), SEE (0.08, 0.07), SHEIB-AGM (0.99, 0.03), SNPHarvester (0.67, 0.28), SNPRuler (0.71, 0.34), and Epi-SSA (0.94, 0.03). The execution time of these algorithms on the DME 100 dataset are shown in S6 Table and S2 Fig. The experimental results clearly indicate that the Epi-SSA algorithm outperforms other algorithms in identifying epistatic interactions on the DME 100 dataset. Although slightly lower than the SHEIB-AGM algorithm in terms of Power, Epi-SSA shows a better performance in F-measure, which is attributed to its effectiveness in reducing false positives in the detection results.

Fig 4 presents a comparative analysis of the F-measure for various algorithms when applied to the DNME 100 dataset. For an exhaustive view of the results, please refer to S7 Table. The mean F-measure and corresponding standard deviation for each algorithm on the DNME 100 dataset are detailed below: AntEpiSeeker (0.11, 0.01), DECMDR (0.19, 0.04), HS-MMGKKG (0.004, 0.00), SEE (0.02, 0.02), SHEIB-AGM (0.86, 0.09), SNPHarvester (0.73, 0.09), SNPRuler (0.61, 0.12), and Epi-SSA (0.97, 0.03). For a detailed examination of the Power and Execution time comparison on the DNME 100 dataset, S8 and S9 Tables, as well as S3 and S4 Figs, should be consulted. The mean Power values for these algorithms on the DNME 100 dataset are as follows: AntEpiSeeker (0.88, 0.05), DECMDR (0.19, 0.04), HS-MMGKKG (0.02, 0.02), SEE (0.03, 0.02), SHEIB-AGM (0.96, 0.09), SNPHarvester (0.73, 0.17), SNPRuler (0.92, 0.10), and Epi-SSA (0.98, 0.03). The execution times for these algorithms on the DNME 100 dataset are delineated in S9 Table and S4 Fig. The experimental results unequivocally demonstrate the superiority of the Epi-SSA algorithm in detecting epistatic interactions within the DNME 100 dataset. Consistently, across various model datasets, Epi-SSA exhibits a pronounced advantage in the detection capability of epistatic interactions.

Fig 5 presents a comparative analysis of the F-measure for an array of algorithms applied to the DME 1000 dataset. For an in-depth examination of the results, S10 Table provides further details. The average F-measure and standard deviation across these algorithms on the DME 1000 dataset are detailed as follows: AntEpiSeeker (0.02, 0.04), DECMDR (0.11, 0.24), HS-MMGKKG (0.01, 0.02), SEE (0.01, 0.01), SHEIB-AGM (0.06, 0.01), SNPHarvester (0.20, 0.20), SNPRuler (0.41, 0.23), and Epi-SSA (0.79, 0.07). For a comprehensive comparison of Power and Execution time on the DME 1000 dataset, S11 and S12 Tables, as well as S5 and S6 Figs, should be consulted. The mean Power and standard deviation for these algorithms on the DME 1000 dataset are as follows: AntEpiSeeker (0.20, 0.30), DECMDR (0.11, 0.24), HS-MMGKKG (0.19, 0.32), SEE (0.10, 0.18), SHEIB-AGM (0.99, 0.02), SNPHarvester (0.20, 0.20), SNPRuler (0.62, 0.35), and Epi-SSA (0.90, 0.05). The execution times for these algorithms on the DME 1000 dataset are delineated in S12 Table and S6 Fig. The experimental data conclusively demonstrate the superiority of the Epi-SSA algorithm in identifying epistatic interactions within the DME 1000 dataset. Notably, even with an increase in the number of SNPs to 1000, Epi-SSA sustains its remarkable capacity for detecting epistatic interactions. When juxtaposed with the SHEIB-AGM algorithm, Epi-SSA achieves a significant reduction in the false positive rate within the detection outcomes.

Fig 6 illustrates a comparative analysis of the F-measure for various algorithms when evaluated on the DNME 1000 dataset. For a more granular examination of the outcomes, S13 Table offers an extensive breakdown. The mean F-measure and standard deviation for these algorithms on the DNME 1000 dataset are presented as follows: AntEpiSeeker (0.02, 0.01), DECMDR (0.01, 0.01), HS-MMGKKG (0.001, 0.00), SEE (0.01, 0.01), SHEIB-AGM (0.06, 0.01), SNPHarvester (0.10, 0.04), SNPRuler (0.56, 0.19), and Epi-SSA (0.86, 0.12). Further insights into the Power and execution time of these algorithms on the DNME 1000 dataset are detailed in S14 and S15 Tables, as well as S7 and S8 Figs. The mean Power and standard deviation for the algorithms on the DNME 1000 dataset are as follows: AntEpiSeeker (0.18, 0.15), DECMDR (0.01, 0.01), HS-MMGKKG (0.02, 0.01), SEE (0.02, 0.02), SHEIB-AGM (0.97, 0.08), SNPHarvester (0.10, 0.04), SNPRuler (0.84, 0.29), and Epi-SSA (0.95, 0.12). The experimental results conclusively demonstrate that the Epi-SSA algorithm excels in identifying epistatic interactions within the DNME 1000 dataset, showcasing its superior performance over other competing algorithms.

Fig 7 presents a comparative analysis of the F-measure for various algorithms when applied to the DNME3 100 dataset. Comprehensive results are detailed in S16 Table. The mean F-measure and standard deviation for these algorithms on the DNME3 100 dataset are as detailed below: DECMDR (0.02, 0.01), HS-MMGKKG (0.002, 0.00), SEE (0.01, 0.01), SHEIB-AGM (0.79, 0.05), and Epi-SSA (0.97, 0.04). For an in-depth comparison of Power and execution time on the DNME3 100 dataset, refer to S17 and S18 Tables, as well as S9 and S10 Figs. The mean Power and standard deviation for these algorithms on the DNME3 100 dataset are as follows: DECMDR (0.02, 0.01), HS-MMGKKG (0.02, 0.02), SEE (0.01, 0.01), SHEIB-AGM (0.99, 0.02), and Epi-SSA (0.95, 0.04). The experimental results provide a clear indication that the Epi-SSA algorithm holds a significant advantage over other algorithms in identifying epistatic interactions within the DNME3 100 dataset. This advantage is particularly pronounced when detecting 3-order epistatic interactions, where the Epi-SSA algorithm consistently exhibits its exceptional performance.

Experiments on real data

We obtained a real dataset from the Wellcome Trust Case Control Consortium (WTCCC) [42], which covers about 14,000 diseased samples for seven common complex diseases: Bipolar Disorder, Coronary Artery Disease, Crohn’s Disease, Hypertension, Rheumatoid Arthritis, Type 1 Diabetes, and Type 2 Diabetes. The dataset is not publicly available. Access can be requested from the owners at https://www.wtccc.org.uk/info/access_to_data_samples.html and https://www.sanger.ac.uk/legal/DAA/MasterController. In addition, the dataset also includes a shared control group of about 3,000 samples. For a detailed description of the dataset, you can refer to S19 Table. We combined the cases with the shared control group for each disease to construct seven GWAS data. In further analysis, following the recommendations of the WTCCC, we excluded samples and SNPs that needed to be removed, as well as those SNPs that did not show variation in all samples. After these filtering steps, we obtained the seven cleaned GWAS data presented in Table 2.

Table 2. The real GWAS data for the seven common complex diseases.

data	disease name	number of SNPs	number of cases	number of controls	number of samples
bd_gwas	Bipolar Disorder	458922	1868	2938	4806
cad_gwas	Coronary Artery Disease	458743	1926	2938	4864
cd_gwas	Crohn’s Disease	459472	1748	2938	4686
ht_gwas	Hypertension	458851	1952	2938	4890
ra_gwas	Rheumatoid Arthritis	458854	1860	2938	4798
t1d_gwas	Type 1 Diabetes	459244	1963	2938	4901
t2d_gwas	Type 2 Diabetes	459112	1924	2938	4862

Open in a new tab

In this work, we applied the Epi-SSA algorithm to analyze the seven GWAS data listed in Table 2 in order to identify epistatic interactions associated with the seven common complex diseases. We detected a large number of epistatic interactions, with some of the results shown in Table 3. Specifically, we found 5,264 epistatic interactions in Bipolar Disorder, 628,817 in Coronary Artery Disease, 3,978 in Crohn’s Disease, 10,013 in Hypertension, 66,642 in Rheumatoid Arthritis, 104,743 in Type 1 Diabetes and 6,334 in Type 2 Diabetes. A detailed list of all detected results has been provided in S20 Table.

Table 3. Part of epistatic interactions found by Epi-SSA (n = 600 maxG = 800000).

p-value	SNP1	SNP2	SNP3	SNP4	SNP5	SNP6
Bipolar Disorder
0	rs6599159	rs3845903	rs9368536	rs1909936	rs16909286
0	rs2023974	rs1024592	rs1909936	rs1556811
0	rs6923059	rs1909936	rs797493	rs396395
0	rs1553460	rs10461624	rs1925454	rs1094138
0	rs1553460	rs41478747	rs7718172	rs41323346
0	rs6852266	rs1553460	rs16892342	rs17144728
Coronary Artery Disease
0	rs1541658	rs9804878	rs16957197	rs4149696	rs6643336	rs979357
0	rs6531531	rs793014	rs10074255	rs17063729	rs16926588	rs5928104
0	rs6848027	rs6531531	rs4868979	rs16938648	rs16969155	rs1199460
0	rs6836401	rs41511044	rs159171	rs6059136	rs16981516	rs41369746
0	rs6777905	rs13126272	rs4881411	rs7189731	rs3269	rs1481162
0	rs7653441	rs17129333	rs9288782	rs11921179	rs41478844	rs4799934
Crohn’s Disease
0	rs16869934	rs16888603	rs494483	rs17083420	rs2332903	rs9956765
0	rs6816863	rs6871834	rs494483	rs16895349	rs7787285
0	rs1398832	rs7091562	rs4471699	rs13339951	rs17002802
0	rs1933641	rs1398832	rs7213498	rs10426571	rs17004382
0	rs16825583	rs16856907	rs1553460	rs6460236	rs16906441
0	rs17577123	rs494483	rs16878847	rs302925
Hypertension
0	rs4867173	rs11050927	rs10843660	rs17078208	rs13332100
0	rs1553460	rs4867173	rs11244965	rs359366	rs9927288
0	rs17046143	rs6840033	rs17116117	rs7124582	rs3764220
0	rs345265	rs7628932	rs16837871	rs4867173	rs7720671
0	rs12060579	rs825148	rs2098536	rs1432960	rs4867173
0	rs17465032	rs2766987	rs17116117	rs2373907
Rheumatoid Arthritis
0	rs2298296	rs1369036	rs16863294	rs17032985	rs3129768	rs851236
0	rs41522846	rs3889096	rs692016	rs7176759	rs16942813
0	rs2490225	rs532806	rs507415	rs17104722	rs4829106
0	rs9268402	rs3129934	rs16872017	rs2001097	rs16970572
0	rs2001099	rs6962909	rs41454849	rs2652020	rs5943990
0	rs3135376	rs2395167	rs16907620	rs17104722	rs5909232
Type 1 Diabetes
0	rs6748474	rs10515517	rs3094123	rs2857212	rs4350455	rs6113065
0	rs644045	rs9261376	rs3129933	rs11985334	rs2468600
0	rs17141406	rs2074508	rs1494160	rs2071278	rs692143
0	rs408359	rs9268429	rs9261389	rs2170416	rs11911295
0	rs3177928	rs4144562	rs2517591	rs2395161	rs16925381
0	rs17495612	rs3177928	rs11158	rs2111428	rs931770
Type 2 Diabetes
0	rs1324132	rs6921387	rs17102342	rs6561351	rs7333888
0	rs16891175	rs9480510	rs1324132	rs1477523	rs711295
0	rs16837871	rs16998188	rs1324132	rs10092007	rs505063
0	rs1447910	rs12198368	rs1324132	rs6938374	rs663335
0	rs2062567	rs41441244	rs1590392	rs1324132	rs468453
0	rs17036088	rs1448952	rs16870674	rs1324132	rs5951720

Open in a new tab

Fig 8 displays the SNP network drawn based on the detection results of Bipolar Disorder using Cytoscape software. For the readability of the network, only SNP pairs with occurrences not less than 4 in the results were included. As shown in the figure, SNPs such as rs7653441, rs1909936, rs1553460, and rs6577370 are of significant importance for the study of Bipolar Disorder. The SNP networks for the other six diseases can be referred to S11–S16 Figs.

The SNPs from the detection results were mapped to the relevant genes through the dbSNP database [43, 44]. For each common and complex disease studied, we statistically analyzed the frequency of occurrence of the genes and gene pairs in the detection results. Genes and gene pairs with a higher frequency of occurrence may play a key role in the occurrence and development of related diseases. To further explore the biological significance of these genes, we conducted an in-depth search using the CTD database (the Comparative Toxicogenomics Database) [45]. In the records of the CTD database, DE represents genes with direct evidence supporting their association with specific diseases, NDE refers to genes that are associated with diseases but lack direct evidence, and NF indicates genes for which there are currently no records showing a direct connection with diseases. For the statistics of genes pairs and gene in the detection results, some results are displayed in Tables 4 and 5, while the complete data are included in S21 and S22 Tables.

Table 4. Part of gene pairs of the epistastic interactions detected on the seven GWAS data using Epi-SSA.

Bipolar Disorder
FNDC3B	NDE	LRIG1	NDE	316
H3C11	NDE	H4C13	NDE	15
CENPN	NDE	MYO3B	NDE	8
ACSL1	NDE	ACSL1	NDE	6
A2M	NDE	KLRG1	NDE	6
CSGALNACT1	NDE	GRIN2A	DE	2
Coronary Artery Disease
AIFM1	NDE	RAB33A	NDE	678
PHEX	NDE	PTCHD1-AS	NDE	603
FNDC3B	NDE	MRAS	DE	156
ACSL1	NDE	GUCY1A1	DE	12
ESR1	DE	FRMPD4	NDE	6
DMD	NDE	LDB2	NDE	5
Crohn’s Disease
ATG16L1	DE	LDB2	NDE	18
TBC1D32	NDE	WWC1	NDE	10
IRGM	DE	PTGFRN	NDE	8
IL23R	DE	KCNIP4	NDE	5
CD274	NDE	RRP15	NDE	4
BTG3	NDE	CXADR	NDE	4
Hypertension
SCOC	NDE	SCOC-AS1	NDE	144
CTNNA3	NDE	LRRTM3	NDE	6
CTC-338M12.4	NDE	TRIM52	NDE	5
BCL9L	NDE	CXCR5	NDE	4
HTR3B	NDE	PDE3A	DE	2
CHRM2	NDE	EDN1	DE	2
Rheumatoid Arthritis
BTNL2	NDE	HLA-DPA1	NDE	109
HLA-DQA2	DE	TSBP1	NDE	101
CFB	NDE	NELFE	NDE	52
PON1	DE	TAP2	NDE	5
HLA-DRA	NDE	PTPN22	DE	4
HLA-DPB1	DE	IKZF3	DE	2
Type 1 Diabetes
LY6G6C	NDE	MPIG6B	NDE	760
CDSN	NDE	PSORS1C1	NDE	319
HLA-DQA1	DE	HLA-DQA2	NDE	95
AGPAT1	NDE	TNF	DE	26
GLIS3	DE	TSBP1	NDE	8
BACH2	DE	HCG20	NDE	6
Type 2 Diabetes
CLIC5	NDE	LRIG1	NDE	18
ACSL1	NDE	ACSL1	NDE	16
SBF2	NDE	TCF7L2	DE	3
CSGALNACT1	NDE	GLIS3	DE	3
GCLC	DE	KIAA1671	NDE	2
HTR3B	NDE	ITGA1	DE	2

Open in a new tab

Table 5. Part of genes of the epistastic interactions detected on the seven GWAS data using Epi-SSA.

Bipolar Disorder
FNDC3B	NDE	1154
LRIG1	NDE	992
CSGALNACT1	NDE	439
NTNG1	DE	4
GRIN2A	DE	3
PDE4B	DE	2
Coronary Artery Disease
FNDC3B	NDE	56444
ACSL1	NDE	28744
DMD	NDE	17273
MRAS	DE	246
PHACTR1	DE	144
LDB2	NDE	90
Crohn’s Disease
KCNIP4	NDE	799
RRP15	NDE	532
LDB2	NDE	492
ATG16L1	DE	99
IL23R	DE	30
IRGM	DE	28
Hypertension
CHRM2	NDE	871
HTR3B	NDE	694
GAN	NDE	243
TGFA	DE	6
EDN1	DE	5
STK39	DE	5
Rheumatoid Arthritis
TSBP1	NDE	9204
BTNL2	NDE	4080
TAP2	NDE	2298
HLA-DPB1	DE	1613
HLA-DQA2	DE	818
PTPN22	DE	8
Type 1 Diabetes
HLA-DQA2	NDE	13335
TSBP1	NDE	10129
AGPAT1	NDE	4716
HLA-DQA1	DE	1519
TNF	DE	71
GLIS3	DE	57
Type 2 Diabetes
CLIC5	NDE	1025
HTR3B	NDE	614
CSGALNACT1	NDE	404
TCF7L2	DE	7
PPARGC1A	DE	6
GLIS3	DE	6

Open in a new tab

Fig 9 displays the gene network drawn based on the detection results of Bipolar Disorder using Cytoscape software [46]. For the readability of the network, only gene pairs with occurrences not less than 4 in the results were included. As shown in the figure, genes such as FNDC3B, LOC107986262, LRIG1, and LOC105375925 are of significant importance for the study of Bipolar Disorder. The gene networks for the other six diseases can be referred to S17–S22 Figs.

Conclusion

In this work, we introduce a novel method for detecting epistatic interactions in GWAS data, termed Epi-SSA. This method is designed based on a multi-objective Sparrow Search Algorithm. To evaluate the performance of the Epi-SSA algorithm, extensive experiments were conducted on five simulated datasets generated using GAMETES_2.1. These experiments compared its capabilities in detecting epistasis from various perspectives with other algorithms.

Initially, the experimental results on the DME 100 and DNME 100 datasets demonstrated that the Epi-SSA algorithm has superior detection capabilities when faced with multiple potential pathogenic models. Although occasionally slightly lower in power compared to the SHEIB-AGM algorithm, Epi-SSA significantly reduced the number of false positives in the results. Subsequently, the results from the DME 1000 and DNME 1000 datasets indicated that as the number of SNPs in the GWAS data increases, the detection capability of the Epi-SSA algorithm still holds a significant advantage over other algorithms. Finally, the results from the DNME3 100 dataset showed that Epi-SSA has a considerable advantage in detecting higher-order epistatic interactions compared to other algorithms. After a multitude of simulation experiments, we are confident that Epi-SSA is an extremely useful algorithm for detecting epistatic interactions in GWAS data, particularly adept at detecting higher-order epistasis.

Furthermore, Epi-SSA was utilized to detect epistasis on a real GWAS dataset of seven complex diseases. It detected a significant number of epistatic interactions related to the seven complex diseases in the dataset and constructed SNP and gene networks for the results. We believe these findings are of significant importance for further exploration of these seven complex diseases.

Supporting information

S1 Fig. Power comparisons between AntEpiSeeker(A), DECMDR(D), HS-MMGKG(G), SEE(S), SHEIB-AGM(B), SNPHarvester(H), SNPRuler(R) and Epi-SSA(P) on the DME 100 dataset.

(PDF)

pone.0311223.s001.pdf^{(33.5KB, pdf)}

S2 Fig. Execution time comparisons between AntEpiSeeker(A), DECMDR(D), HS-MMGKG(G), SEE(S), SHEIB-AGM(B), SNPHarvester(H), SNPRuler(R) and Epi-SSA(P) on the DME 100 dataset.

(PDF)

pone.0311223.s002.pdf^{(32.3KB, pdf)}

S3 Fig. Power comparisons between AntEpiSeeker(A), DECMDR(D), HS-MMGKG(G), SEE(S), SHEIB-AGM(B), SNPHarvester(H), SNPRuler(R) and Epi-SSA(P) on the DNME 100 dataset.

(PDF)

pone.0311223.s003.pdf^{(32.6KB, pdf)}

S4 Fig. Execution time comparisons between AntEpiSeeker(A), DECMDR(D), HS-MMGKG(G), SEE(S), SHEIB-AGM(B), SNPHarvester(H), SNPRuler(R) and Epi-SSA(P) on the DNME 100 dataset.

(PDF)

pone.0311223.s004.pdf^{(33KB, pdf)}

S5 Fig. Power comparisons between AntEpiSeeker(A), DECMDR(D), HS-MMGKG(G), SEE(S), SHEIB-AGM(B), SNPHarvester(H), SNPRuler(R) and Epi-SSA(P) on the DME 1000 dataset.

(PDF)

pone.0311223.s005.pdf^{(32.3KB, pdf)}

S6 Fig. Execution time comparisons between AntEpiSeeker(A), DECMDR(D), HS-MMGKG(G), SEE(S), SHEIB-AGM(B), SNPHarvester(H), SNPRuler(R) and Epi-SSA(P) on the DME 1000 dataset.

(PDF)

pone.0311223.s006.pdf^{(32.9KB, pdf)}

S7 Fig. Power comparisons between AntEpiSeeker(A), DECMDR(D), HS-MMGKG(G), SEE(S), SHEIB-AGM(B), SNPHarvester(H), SNPRuler(R) and Epi-SSA(P) on the DNME 1000 dataset.

(PDF)

pone.0311223.s007.pdf^{(32.6KB, pdf)}

S8 Fig. Execution time comparisons between AntEpiSeeker(A), DECMDR(D), HS-MMGKG(G), SEE(S), SHEIB-AGM(B), SNPHarvester(H), SNPRuler(R) and Epi-SSA(P) on the DNME 1000 dataset.

(PDF)

pone.0311223.s008.pdf^{(33KB, pdf)}

S9 Fig. Power comparisons between DECMDR(D), HS-MMGKG(G), SEE(S), SHEIB-AGM(B) and Epi-SSA(P) on the DNME3 100 dataset.

(PDF)

pone.0311223.s009.pdf^{(30.4KB, pdf)}

S10 Fig. Execution time comparisons between DECMDR(D), HS-MMGKG(G), SEE(S), SHEIB-AGM(B) and Epi-SSA(P) on the DNME3 100 dataset.

(PDF)

pone.0311223.s010.pdf^{(31KB, pdf)}

S11 Fig. The SNP network of the epistatic interactions detected for Coronary Artery Disease (only including SNP pairs with occurrences not less than 60).

(PDF)

pone.0311223.s011.pdf^{(27.5KB, pdf)}

S12 Fig. The SNP network of the epistatic interactions detected for Crohn’s Disease (only including SNP pairs with occurrences not less than 4).

(PDF)

pone.0311223.s012.pdf^{(28.1KB, pdf)}

S13 Fig. The SNP network of the epistatic interactions detected for Hypertension (only including SNP pairs with occurrences not less than 4).

(PDF)

pone.0311223.s013.pdf^{(30.9KB, pdf)}

S14 Fig. The SNP network of the epistatic interactions detected for Rheumatoid Arthritis (only including SNP pairs with occurrences not less than 20).

(PDF)

pone.0311223.s014.pdf^{(34.1KB, pdf)}

S15 Fig. The SNP network of the epistatic interactions detected for Type 1 Diabetes (only including SNP pairs with occurrences not less than 20).

(PDF)

pone.0311223.s015.pdf^{(42KB, pdf)}

S16 Fig. The SNP network of the epistatic interactions detected for Type 2 Diabetes (only including SNP pairs with occurrences not less than 4).

(PDF)

pone.0311223.s016.pdf^{(27.8KB, pdf)}

S17 Fig. The gene network of the epistatic interactions detected for Coronary Artery Disease (only including gene pairs with occurrences not less than 60).

(PDF)

pone.0311223.s017.pdf^{(33.6KB, pdf)}

S18 Fig. The gene network of the epistatic interactions detected for Crohn’s Disease (only including gene pairs with occurrences not less than 4).

(PDF)

pone.0311223.s018.pdf^{(26.3KB, pdf)}

S19 Fig. The gene network of the epistatic interactions detected for Hypertension(only including gene pairs with occurrences not less than 4).

(PDF)

pone.0311223.s019.pdf^{(28.8KB, pdf)}

S20 Fig. The gene network of the epistatic interactions detected for Rheumatoid Arthritis(only including gene pairs with occurrences not less than 20).

(PDF)

pone.0311223.s020.pdf^{(28.3KB, pdf)}

S21 Fig. The gene network of the epistatic interactions detected for Type 1 Diabetes(only including gene pairs with occurrences not less than 20).

(PDF)

pone.0311223.s021.pdf^{(35KB, pdf)}

S22 Fig. The gene network of the epistatic interactions detected for Type 2 Diabetes (only including gene pairs with occurrences not less than 4).

(PDF)

pone.0311223.s022.pdf^{(26.3KB, pdf)}

S1 Table. The penetrance tables for the 8 DME models.

(XLSX)

pone.0311223.s023.xlsx^{(10.2KB, xlsx)}

S2 Table. The penetrance tables for the 8 DNME models.

(XLSX)

pone.0311223.s024.xlsx^{(10.8KB, xlsx)}

S3 Table. The penetrance tables for the 6 DNME3 models.

(XLSX)

pone.0311223.s025.xlsx^{(12.1KB, xlsx)}

S4 Table. F-measure comparisons between AntEpiSeeker(A), DECMDR(D), HS-MMGKG(G), SEE(S), SHEIB-AGM(B), SNPHarvester(H), SNPRuler(R) and Epi-SSA(P) on the DME 100 dataset.

(XLSX)

pone.0311223.s026.xlsx^{(10.4KB, xlsx)}

S5 Table. Power comparisons between AntEpiSeeker(A), DECMDR(D), HS-MMGKG(G), SEE(S), SHEIB-AGM(B), SNPHarvester(H), SNPRuler(R) and Epi-SSA(P) on the DME 100 dataset.

(XLSX)

pone.0311223.s027.xlsx^{(10.2KB, xlsx)}

S6 Table. Execution time comparisons between AntEpiSeeker(A), DECMDR(D), HS-MMGKG(G), SEE(S), SHEIB-AGM(B), SNPHarvester(H), SNPRuler(R) and Epi-SSA(P) on the DME 100 dataset.

(XLSX)

pone.0311223.s028.xlsx^{(10.9KB, xlsx)}

S7 Table. F-measure comparisons between AntEpiSeeker(A), DECMDR(D), HS-MMGKG(G), SEE(S), SHEIB-AGM(B), SNPHarvester(H), SNPRuler(R) and Epi-SSA(P) on the DNME 100 dataset.

(XLSX)

pone.0311223.s029.xlsx^{(10.4KB, xlsx)}

S8 Table. Power comparisons between AntEpiSeeker(A), DECMDR(D), HS-MMGKG(G), SEE(S), SHEIB-AGM(B), SNPHarvester(H), SNPRuler(R) and Epi-SSA(P) on the DNME 100 dataset.

(XLSX)

pone.0311223.s030.xlsx^{(10.7KB, xlsx)}

S9 Table. Execution time comparisons between AntEpiSeeker(A), DECMDR(D), HS-MMGKG(G), SEE(S), SHEIB-AGM(B), SNPHarvester(H), SNPRuler(R) and Epi-SSA(P) on the DNME 100 dataset.

(XLSX)

pone.0311223.s031.xlsx^{(10.4KB, xlsx)}

S10 Table. F-measure comparisons between AntEpiSeeker(A), DECMDR(D), HS-MMGKG(G), SEE(S), SHEIB-AGM(B), SNPHarvester(H), SNPRuler(R) and Epi-SSA(P) on the DME 1000 dataset.

(XLSX)

pone.0311223.s032.xlsx^{(11KB, xlsx)}

S11 Table. Power comparisons between AntEpiSeeker(A), DECMDR(D), HS-MMGKG(G), SEE(S), SHEIB-AGM(B), SNPHarvester(H), SNPRuler(R) and Epi-SSA(P) on the DME 1000 dataset.

(XLSX)

pone.0311223.s033.xlsx^{(10.2KB, xlsx)}

S12 Table. Execution time comparisons between AntEpiSeeker(A), DECMDR(D), HS-MMGKG(G), SEE(S), SHEIB-AGM(B), SNPHarvester(H), SNPRuler(R) and Epi-SSA(P) on the DME 1000 dataset.

(XLSX)

pone.0311223.s034.xlsx^{(11KB, xlsx)}

S13 Table. F-measure comparisons between AntEpiSeeker(A), DECMDR(D), HS-MMGKG(G), SEE(S), SHEIB-AGM(B), SNPHarvester(H), SNPRuler(R) and Epi-SSA(P) on the DNME 1000 dataset.

(XLSX)

pone.0311223.s035.xlsx^{(10.4KB, xlsx)}

S14 Table. Power comparisons between AntEpiSeeker(A), DECMDR(D), HS-MMGKG(G), SEE(S), SHEIB-AGM(B), SNPHarvester(H), SNPRuler(R) and Epi-SSA(P) on the DNME 1000 dataset.

(XLSX)

pone.0311223.s036.xlsx^{(10.3KB, xlsx)}

S15 Table. Execution time comparisons between AntEpiSeeker(A), DECMDR(D), HS-MMGKG(G), SEE(S), SHEIB-AGM(B), SNPHarvester(H), SNPRuler(R) and Epi-SSA(P) on the DNME 1000 dataset.

(XLSX)

pone.0311223.s037.xlsx^{(11KB, xlsx)}

S16 Table. F-measure comparisons between DECMDR(D), HS-MMGKG(G), SEE(S), SHEIB-AGM(B) and Epi-SSA(P) on the DNME3 100 dataset.

(XLSX)

pone.0311223.s038.xlsx^{(10KB, xlsx)}

S17 Table. Power comparisons between DECMDR(D), HS-MMGKG(G), SEE(S), SHEIB-AGM(B) and Epi-SSA(P) on the DNME3 100 dataset.

(XLSX)

pone.0311223.s039.xlsx^{(9.9KB, xlsx)}

S18 Table. Execution time comparisons between DECMDR(D), HS-MMGKG(G), SEE(S), SHEIB-AGM(B) and Epi-SSA(P) on the DNME3 100 dataset.

(XLSX)

pone.0311223.s040.xlsx^{(10.6KB, xlsx)}

S19 Table. The brief description on the WTCCC dataset.

(XLSX)

pone.0311223.s041.xlsx^{(9.8KB, xlsx)}

S20 Table. Epistatic interactions found by Epi-SSA (n = 600 maxG = 800000).

(XLSX)

pone.0311223.s042.xlsx^{(17.5MB, xlsx)}

S21 Table. Gene pairs of the epistastic interactions detected on the seven GWAS data using Epi-SSA.

(XLSX)

pone.0311223.s043.xlsx^{(7.4MB, xlsx)}

S22 Table. Genes of the epistastic interactions detected on the seven GWAS data using Epi-SSA.

(XLSX)

pone.0311223.s044.xlsx^{(800.8KB, xlsx)}

Data Availability

The simulated data and the source code for Epi-SSA can be accessed via the following link: https://osf.io/6sqwj/. Regarding the real data used in our experiments, which is the WTCCC data, we do not have the rights to publicly upload it. However, interested readers can apply for access to the data through the provided link: https://www.wtccc.org.uk/info/access_to_data_samples.html and https://www.sanger.ac.uk/legal/DAA/MasterController.

Funding Statement

This work was supported by the Changchun University level scientific research project - Growth Climbing Plan (No.2021JBD05L12) and Jilin Provincial Department of Education Science and Technology Plan project (Nos. JJKH20230676KJ and 2023LY505L22).

References

1. Schork NJ, Murray SS, Frazer KA, Topol EJ. Common vs. rare allele hypotheses for complex diseases. Current Opinion in Genetics & Development. 2009;19: 212–219. doi: 10.1016/j.gde.2009.04.010 [DOI] [PMC free article] [PubMed] [Google Scholar]
2. Gilissen C, Hoischen A, Brunner HG, Veltman JA. Unlocking Mendelian disease using exome sequencing. Genome Biol. 2011;12: 228. doi: 10.1186/gb-2011-12-9-228 [DOI] [PMC free article] [PubMed] [Google Scholar]
3. Hernandez DG, Reed X, Singleton AB. Genetics in Parkinson disease: Mendelian versus non-Mendelian inheritance. J Neurochem. 2016;139: 59–74. doi: 10.1111/jnc.13593 [DOI] [PMC free article] [PubMed] [Google Scholar]
4. van Heyningen V. Mechanisms of non-Mendelian inheritance in genetic disease. Human Molecular Genetics. 2004;13: R225–R233. doi: 10.1093/hmg/ddh254 [DOI] [PubMed] [Google Scholar]
5. Cordell HJ. Epistasis: what it means, what it doesn’t mean, and statistical methods to detect it in humans. Human Molecular Genetics. 2002;11: 2463–2468. doi: 10.1093/hmg/11.20.2463 [DOI] [PubMed] [Google Scholar]
6. Phillips PC. Epistasis — the essential role of gene interactions in the structure and evolution of genetic systems. Nat Rev Genet. 2008;9: 855–867. doi: 10.1038/nrg2452 [DOI] [PMC free article] [PubMed] [Google Scholar]
7. Mackay TF, Moore JH. Why epistasis is important for tackling complex human disease genetics. Genome Med. 2014;6: 125. doi: 10.1186/gm561 [DOI] [PMC free article] [PubMed] [Google Scholar]
8. Mackay TFC, Anholt RRH. Pleiotropy, epistasis and the genetic architecture of quantitative traits. Nature Reviews Genetics. 2024. doi: 10.1038/s41576-024-00711-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
9. Collins A, Lonjou C, Morton NE. Genetic epidemiology of single-nucleotide polymorphisms. Proceedings of the National Academy of Sciences. 1999;96: 15173–15177. doi: 10.1073/pnas.96.26.15173 [DOI] [PMC free article] [PubMed] [Google Scholar]
10. Schork NJ, Fallin D, Lanchbury JS. Single nucleotide polymorphisms and the future of genetic epidemiology: SNPs and genetic epidemiology. Clinical Genetics. 2000;58: 250–264. doi: 10.1034/j.1399-0004.2000.580402.x [DOI] [PubMed] [Google Scholar]
11. Gray IC, Campbell DA, Spurr NK. Single nucleotide polymorphisms as tools in human genetics. Human Molecular Genetics. 2000;9: 2403–2408. doi: 10.1093/hmg/9.16.2403 [DOI] [PubMed] [Google Scholar]
12. Zhou T, Chen B, Ke Q, Zhao J, Pu F, Wu Y, et al. Development and Evaluation of a High-Throughput Single-Nucleotide Polymorphism Array for Large Yellow Croaker (Larimichthys crocea). Front Genet. 2020;11: 571751. doi: 10.3389/fgene.2020.571751 [DOI] [PMC free article] [PubMed] [Google Scholar]
13. Sud A, Kinnersley B, Houlston RS. Genome-wide association studies of cancer: current insights and future perspectives. Nat Rev Cancer. 2017;17: 692–704. doi: 10.1038/nrc.2017.82 [DOI] [PubMed] [Google Scholar]
14. Erdmann J, Kessler T, Munoz Venegas L, Schunkert H. A decade of genome-wide association studies for coronary artery disease: the challenges ahead. Cardiovascular Research. 2018. [cited 8 Feb 2022]. doi: 10.1093/cvr/cvy084 [DOI] [PubMed] [Google Scholar]
15. Uffelmann E, Huang QQ, Munung NS, De Vries J, Okada Y, Martin AR, et al. Genome-wide association studies. Nat Rev Methods Primers. 2021;1: 59. doi: 10.1038/s43586-021-00056-9 [DOI] [Google Scholar]
16. Naito T, Inoue K, Sonehara K, Baba R, Kodama T, Otagaki Y, et al. Genetic Risk of Primary Aldosteronism and Its Contribution to Hypertension: A Cross-Ancestry Meta-Analysis of Genome-Wide Association Studies. Circulation. 2023;147: 1097–1109. doi: 10.1161/CIRCULATIONAHA.122.062349 [DOI] [PMC free article] [PubMed] [Google Scholar]
17. Yang C, He Z, Wan X, Yang Q, Xue H, Yu W. SNPHarvester: a filtering-based approach for detecting epistatic interactions in genome-wide association studies. Bioinformatics. 2009;25: 504–511. doi: 10.1093/bioinformatics/btn652 [DOI] [PubMed] [Google Scholar]
18. Wan X, Yang C, Yang Q, Xue H, Tang NLS, Yu W. Predictive rule inference for epistatic interaction detection in genome-wide association studies. Bioinformatics. 2010;26: 30–37. doi: 10.1093/bioinformatics/btq486 [DOI] [PubMed] [Google Scholar]
19. Wan X, Yang C, Yang Q, Xue H, Fan X, Tang NLS, et al. BOOST: A Fast Approach to Detecting Gene-Gene Interactions in Genome-wide Case-Control Studies. The American Journal of Human Genetics. 2010;87: 325–340. doi: 10.1016/j.ajhg.2010.07.021 [DOI] [PMC free article] [PubMed] [Google Scholar]
20. Wang Y, Liu X, Robbins K, Rekaya R. AntEpiSeeker: detecting epistatic interactions for case-control studies using a two-stage ant colony optimization algorithm. BMC Res Notes. 2010;3: 117. doi: 10.1186/1756-0500-3-117 [DOI] [PMC free article] [PubMed] [Google Scholar]
21. Jing P-J, Shen H-B. MACOED: a multi-objective ant colony optimization algorithm for SNP epistasis detection in genome-wide association studies. Bioinformatics. 2015;31: 634–641. doi: 10.1093/bioinformatics/btu702 [DOI] [PubMed] [Google Scholar]
22. Tuo S, Zhang J, Yuan X, Zhang Y, Liu Z. FHSA-SED: Two-Locus Model Detection for Genome-Wide Association Study with Harmony Search Algorithm. Xue Y, editor. PLoS ONE. 2016;11: e0150669. doi: 10.1371/journal.pone.0150669 [DOI] [PMC free article] [PubMed] [Google Scholar]
23. Yang C-H, Chuang L-Y, Lin Y-D. CMDR based differential evolution identifies the epistatic interaction in genome-wide association studies. Wren J, editor. Bioinformatics. 2017;33: 2354–2362. doi: 10.1093/bioinformatics/btx163 [DOI] [PubMed] [Google Scholar]
24. Sun L, Liu G, Su L, Wang R. HS-MMGKG: A Fast Multi-objective Harmony Search Algorithm for Two-locus Model Detection in GWAS. CBIO. 2019;14: 749–761. doi: 10.2174/1574893614666190409110843 [DOI] [Google Scholar]
25. Sun L, Liu G, Su L, Wang R. SEE: a novel multi-objective evolutionary algorithm for identifying SNP epistasis in genome-wide association studies. Biotechnology & Biotechnological Equipment. 2019;33: 529–547. doi: 10.1080/13102818.2019.1593052 [DOI] [Google Scholar]
26. Liu Y, Wang D, He F, Wang J, Joshi T, Xu D. Phenotype Prediction and Genome-Wide Association Study Using Deep Convolutional Neural Network of Soybean. Front Genet. 2019;10: 1091. doi: 10.3389/fgene.2019.01091 [DOI] [PMC free article] [PubMed] [Google Scholar]
27. Tuo S, Liu H, Chen H. Multipopulation harmony search algorithm for the detection of high-order SNP interactions. Robinson P, editor. Bioinformatics. 2020;36: 4389–4398. doi: 10.1093/bioinformatics/btaa215 [DOI] [PubMed] [Google Scholar]
28. Sun L, Liu G, Wang R. SHEIB-AGM: A Novel Stochastic Approach for Detecting High-Order Epistatic Interactions Using Bioinformation With Automatic Gene Matrix in Genome-Wide Association Studies. IEEE Access. 2020;8: 21676–21693. doi: 10.1109/ACCESS.2020.2969465 [DOI] [Google Scholar]
29. Mieth B, Rozier A, Rodriguez JA, Höhne MMC, Görnitz N, Müller K-R. DeepCOMBI: explainable artificial intelligence for the analysis and discovery in genome-wide association studies. NAR Genomics and Bioinformatics. 2021;3: lqab065. doi: 10.1093/nargab/lqab065 [DOI] [PMC free article] [PubMed] [Google Scholar]
30. Bayat A, Hosking B, Jain Y, Hosking C, Kodikara M, Reti D, et al. Fast and accurate exhaustive higher-order epistasis search with BitEpi. Sci Rep. 2021;11: 15923. doi: 10.1038/s41598-021-94959-y [DOI] [PMC free article] [PubMed] [Google Scholar]
31. Chen Y, Xu F, Pian C, Xu M, Kong L, Fang J, et al. EpiMOGA: An Epistasis Detection Method Based on a Multi-Objective Genetic Algorithm. Genes. 2021;12: 191. doi: 10.3390/genes12020191 [DOI] [PMC free article] [PubMed] [Google Scholar]
32. Tuo S, Li C, Liu F, Zhu Y, Chen T, Feng Z, et al. A Novel Multitasking Ant Colony Optimization Method for Detecting Multiorder SNP Interactions. Interdisciplinary Sciences: Computational Life Sciences. 2022;14: 814–832. [DOI] [PubMed] [Google Scholar]
33. Tuo S, Liu F, Feng Z, Li C, Zhu Y, Chen T, et al. Membrane computing with harmony search algorithm for gene selection from expression and methylation data. Journal of Membrane Computing. 2022;4: 293–313. doi: 10.1007/s41965-022-00111-8 [DOI] [Google Scholar]
34. Gu Y, Sun Y, Shang J, Li F, Guan B, Liu J-X. Multi-Objective Artificial Bee Colony Algorithm Based on Scale-Free Network for Epistasis Detection. Genes. 2022;13: 871. doi: 10.3390/genes13050871 [DOI] [PMC free article] [PubMed] [Google Scholar]
35. Tuo S, Li C, Liu F, Li A, He L, Geem ZW, et al. MTHSA-DHEI: multitasking harmony search algorithm for detecting high-order SNP epistatic interactions. Complex Intell Syst. 2023;9: 637–658. doi: 10.1007/s40747-022-00813-7 [DOI] [Google Scholar]
36. Tuo S, Jiang J. A Novel Detection Method for High-Order SNP Epistatic Interactions Based on Explicit-Encoding-Based Multitasking Harmony Search. Interdisciplinary Sciences: Computational Life Sciences. 2024. [DOI] [PubMed] [Google Scholar]
37. Xue J, Shen B. A novel swarm intelligence optimization approach: sparrow search algorithm. Systems Science & Control Engineering. 2020;8: 22–34. doi: 10.1080/21642583.2019.1708830 [DOI] [Google Scholar]
38. Li B, Wang H. Multi-objective sparrow search algorithm: A novel algorithm for solving complex multi-objective optimisation problems. Expert Systems with Applications. 2022;210: 118414. doi: 10.1016/j.eswa.2022.118414 [DOI] [Google Scholar]
39. Gharehchopogh FS, Namazi M, Ebrahimi L, Abdollahzadeh B. Advances in Sparrow Search Algorithm: A Comprehensive Survey. Arch Computat Methods Eng. 2023;30: 427–455. doi: 10.1007/s11831-022-09804-w [DOI] [PMC free article] [PubMed] [Google Scholar]
40. Awadallah MA, Al-Betar MA, Doush IA, Makhadmeh SN, Al-Naymat G. Recent Versions and Applications of Sparrow Search Algorithm. Archives of Computational Methods in Engineering. 2023;30: 2831–2858. doi: 10.1007/s11831-023-09887-z [DOI] [PMC free article] [PubMed] [Google Scholar]
41. Urbanowicz RJ, Kiralis J, Sinnott-Armstrong NA, Heberling T, Fisher JM, Moore JH. GAMETES: a fast, direct algorithm for generating pure, strict, epistatic models with random architectures. BioData Mining. 2012;5: 16. doi: 10.1186/1756-0381-5-16 [DOI] [PMC free article] [PubMed] [Google Scholar]
42. The Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007;447: 661–678. doi: 10.1038/nature05911 [DOI] [PMC free article] [PubMed] [Google Scholar]
43. Sherry ST. dbSNP: the NCBI database of genetic variation. Nucleic Acids Research. 2001;29: 308–311. doi: 10.1093/nar/29.1.308 [DOI] [PMC free article] [PubMed] [Google Scholar]
44. Sayers EW, Barrett T, Benson DA, Bolton E, Bryant SH, Canese K, et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Research. 2012;40: D13–D25. doi: 10.1093/nar/gkr1184 [DOI] [PMC free article] [PubMed] [Google Scholar]
45. Mattingly CJ, Colby GT, Forrest JN, Boyer JL. The Comparative Toxicogenomics Database (CTD). Environmental Health Perspectives. 2003;111: 793–795. doi: 10.1289/ehp.6028 [DOI] [PMC free article] [PubMed] [Google Scholar]
46. Kohl M, Wiese S, Warscheid B. Cytoscape: Software for Visualization and Analysis of Biological Networks. In: Hamacher M, Eisenacher M, Stephan C, editors. Data Mining in Proteomics. Totowa, NJ: Humana Press; 2011. pp. 291–303. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

S1 Fig. Power comparisons between AntEpiSeeker(A), DECMDR(D), HS-MMGKG(G), SEE(S), SHEIB-AGM(B), SNPHarvester(H), SNPRuler(R) and Epi-SSA(P) on the DME 100 dataset.

(PDF)

pone.0311223.s001.pdf^{(33.5KB, pdf)}

S2 Fig. Execution time comparisons between AntEpiSeeker(A), DECMDR(D), HS-MMGKG(G), SEE(S), SHEIB-AGM(B), SNPHarvester(H), SNPRuler(R) and Epi-SSA(P) on the DME 100 dataset.

(PDF)

pone.0311223.s002.pdf^{(32.3KB, pdf)}

S3 Fig. Power comparisons between AntEpiSeeker(A), DECMDR(D), HS-MMGKG(G), SEE(S), SHEIB-AGM(B), SNPHarvester(H), SNPRuler(R) and Epi-SSA(P) on the DNME 100 dataset.

(PDF)

pone.0311223.s003.pdf^{(32.6KB, pdf)}

S4 Fig. Execution time comparisons between AntEpiSeeker(A), DECMDR(D), HS-MMGKG(G), SEE(S), SHEIB-AGM(B), SNPHarvester(H), SNPRuler(R) and Epi-SSA(P) on the DNME 100 dataset.

(PDF)

pone.0311223.s004.pdf^{(33KB, pdf)}

S5 Fig. Power comparisons between AntEpiSeeker(A), DECMDR(D), HS-MMGKG(G), SEE(S), SHEIB-AGM(B), SNPHarvester(H), SNPRuler(R) and Epi-SSA(P) on the DME 1000 dataset.

(PDF)

pone.0311223.s005.pdf^{(32.3KB, pdf)}

S6 Fig. Execution time comparisons between AntEpiSeeker(A), DECMDR(D), HS-MMGKG(G), SEE(S), SHEIB-AGM(B), SNPHarvester(H), SNPRuler(R) and Epi-SSA(P) on the DME 1000 dataset.

(PDF)

pone.0311223.s006.pdf^{(32.9KB, pdf)}

S7 Fig. Power comparisons between AntEpiSeeker(A), DECMDR(D), HS-MMGKG(G), SEE(S), SHEIB-AGM(B), SNPHarvester(H), SNPRuler(R) and Epi-SSA(P) on the DNME 1000 dataset.

(PDF)

pone.0311223.s007.pdf^{(32.6KB, pdf)}

S8 Fig. Execution time comparisons between AntEpiSeeker(A), DECMDR(D), HS-MMGKG(G), SEE(S), SHEIB-AGM(B), SNPHarvester(H), SNPRuler(R) and Epi-SSA(P) on the DNME 1000 dataset.

(PDF)

pone.0311223.s008.pdf^{(33KB, pdf)}

S9 Fig. Power comparisons between DECMDR(D), HS-MMGKG(G), SEE(S), SHEIB-AGM(B) and Epi-SSA(P) on the DNME3 100 dataset.

(PDF)

pone.0311223.s009.pdf^{(30.4KB, pdf)}

S10 Fig. Execution time comparisons between DECMDR(D), HS-MMGKG(G), SEE(S), SHEIB-AGM(B) and Epi-SSA(P) on the DNME3 100 dataset.

(PDF)

pone.0311223.s010.pdf^{(31KB, pdf)}

S11 Fig. The SNP network of the epistatic interactions detected for Coronary Artery Disease (only including SNP pairs with occurrences not less than 60).

(PDF)

pone.0311223.s011.pdf^{(27.5KB, pdf)}

S12 Fig. The SNP network of the epistatic interactions detected for Crohn’s Disease (only including SNP pairs with occurrences not less than 4).

(PDF)

pone.0311223.s012.pdf^{(28.1KB, pdf)}

S13 Fig. The SNP network of the epistatic interactions detected for Hypertension (only including SNP pairs with occurrences not less than 4).

(PDF)

pone.0311223.s013.pdf^{(30.9KB, pdf)}

S14 Fig. The SNP network of the epistatic interactions detected for Rheumatoid Arthritis (only including SNP pairs with occurrences not less than 20).

(PDF)

pone.0311223.s014.pdf^{(34.1KB, pdf)}

S15 Fig. The SNP network of the epistatic interactions detected for Type 1 Diabetes (only including SNP pairs with occurrences not less than 20).

(PDF)

pone.0311223.s015.pdf^{(42KB, pdf)}

S16 Fig. The SNP network of the epistatic interactions detected for Type 2 Diabetes (only including SNP pairs with occurrences not less than 4).

(PDF)

pone.0311223.s016.pdf^{(27.8KB, pdf)}

S17 Fig. The gene network of the epistatic interactions detected for Coronary Artery Disease (only including gene pairs with occurrences not less than 60).

(PDF)

pone.0311223.s017.pdf^{(33.6KB, pdf)}

S18 Fig. The gene network of the epistatic interactions detected for Crohn’s Disease (only including gene pairs with occurrences not less than 4).

(PDF)

pone.0311223.s018.pdf^{(26.3KB, pdf)}

S19 Fig. The gene network of the epistatic interactions detected for Hypertension(only including gene pairs with occurrences not less than 4).

(PDF)

pone.0311223.s019.pdf^{(28.8KB, pdf)}

S20 Fig. The gene network of the epistatic interactions detected for Rheumatoid Arthritis(only including gene pairs with occurrences not less than 20).

(PDF)

pone.0311223.s020.pdf^{(28.3KB, pdf)}

S21 Fig. The gene network of the epistatic interactions detected for Type 1 Diabetes(only including gene pairs with occurrences not less than 20).

(PDF)

pone.0311223.s021.pdf^{(35KB, pdf)}

S22 Fig. The gene network of the epistatic interactions detected for Type 2 Diabetes (only including gene pairs with occurrences not less than 4).

(PDF)

pone.0311223.s022.pdf^{(26.3KB, pdf)}

S1 Table. The penetrance tables for the 8 DME models.

(XLSX)

pone.0311223.s023.xlsx^{(10.2KB, xlsx)}

S2 Table. The penetrance tables for the 8 DNME models.

(XLSX)

pone.0311223.s024.xlsx^{(10.8KB, xlsx)}

S3 Table. The penetrance tables for the 6 DNME3 models.

(XLSX)

pone.0311223.s025.xlsx^{(12.1KB, xlsx)}

S4 Table. F-measure comparisons between AntEpiSeeker(A), DECMDR(D), HS-MMGKG(G), SEE(S), SHEIB-AGM(B), SNPHarvester(H), SNPRuler(R) and Epi-SSA(P) on the DME 100 dataset.

(XLSX)

pone.0311223.s026.xlsx^{(10.4KB, xlsx)}

S5 Table. Power comparisons between AntEpiSeeker(A), DECMDR(D), HS-MMGKG(G), SEE(S), SHEIB-AGM(B), SNPHarvester(H), SNPRuler(R) and Epi-SSA(P) on the DME 100 dataset.

(XLSX)

pone.0311223.s027.xlsx^{(10.2KB, xlsx)}

S6 Table. Execution time comparisons between AntEpiSeeker(A), DECMDR(D), HS-MMGKG(G), SEE(S), SHEIB-AGM(B), SNPHarvester(H), SNPRuler(R) and Epi-SSA(P) on the DME 100 dataset.

(XLSX)

pone.0311223.s028.xlsx^{(10.9KB, xlsx)}

S7 Table. F-measure comparisons between AntEpiSeeker(A), DECMDR(D), HS-MMGKG(G), SEE(S), SHEIB-AGM(B), SNPHarvester(H), SNPRuler(R) and Epi-SSA(P) on the DNME 100 dataset.

(XLSX)

pone.0311223.s029.xlsx^{(10.4KB, xlsx)}

S8 Table. Power comparisons between AntEpiSeeker(A), DECMDR(D), HS-MMGKG(G), SEE(S), SHEIB-AGM(B), SNPHarvester(H), SNPRuler(R) and Epi-SSA(P) on the DNME 100 dataset.

(XLSX)

pone.0311223.s030.xlsx^{(10.7KB, xlsx)}

S9 Table. Execution time comparisons between AntEpiSeeker(A), DECMDR(D), HS-MMGKG(G), SEE(S), SHEIB-AGM(B), SNPHarvester(H), SNPRuler(R) and Epi-SSA(P) on the DNME 100 dataset.

(XLSX)

pone.0311223.s031.xlsx^{(10.4KB, xlsx)}

S10 Table. F-measure comparisons between AntEpiSeeker(A), DECMDR(D), HS-MMGKG(G), SEE(S), SHEIB-AGM(B), SNPHarvester(H), SNPRuler(R) and Epi-SSA(P) on the DME 1000 dataset.

(XLSX)

pone.0311223.s032.xlsx^{(11KB, xlsx)}

S11 Table. Power comparisons between AntEpiSeeker(A), DECMDR(D), HS-MMGKG(G), SEE(S), SHEIB-AGM(B), SNPHarvester(H), SNPRuler(R) and Epi-SSA(P) on the DME 1000 dataset.

(XLSX)

pone.0311223.s033.xlsx^{(10.2KB, xlsx)}

S12 Table. Execution time comparisons between AntEpiSeeker(A), DECMDR(D), HS-MMGKG(G), SEE(S), SHEIB-AGM(B), SNPHarvester(H), SNPRuler(R) and Epi-SSA(P) on the DME 1000 dataset.

(XLSX)

pone.0311223.s034.xlsx^{(11KB, xlsx)}

S13 Table. F-measure comparisons between AntEpiSeeker(A), DECMDR(D), HS-MMGKG(G), SEE(S), SHEIB-AGM(B), SNPHarvester(H), SNPRuler(R) and Epi-SSA(P) on the DNME 1000 dataset.

(XLSX)

pone.0311223.s035.xlsx^{(10.4KB, xlsx)}

S14 Table. Power comparisons between AntEpiSeeker(A), DECMDR(D), HS-MMGKG(G), SEE(S), SHEIB-AGM(B), SNPHarvester(H), SNPRuler(R) and Epi-SSA(P) on the DNME 1000 dataset.

(XLSX)

pone.0311223.s036.xlsx^{(10.3KB, xlsx)}

S15 Table. Execution time comparisons between AntEpiSeeker(A), DECMDR(D), HS-MMGKG(G), SEE(S), SHEIB-AGM(B), SNPHarvester(H), SNPRuler(R) and Epi-SSA(P) on the DNME 1000 dataset.

(XLSX)

pone.0311223.s037.xlsx^{(11KB, xlsx)}

S16 Table. F-measure comparisons between DECMDR(D), HS-MMGKG(G), SEE(S), SHEIB-AGM(B) and Epi-SSA(P) on the DNME3 100 dataset.

(XLSX)

pone.0311223.s038.xlsx^{(10KB, xlsx)}

S17 Table. Power comparisons between DECMDR(D), HS-MMGKG(G), SEE(S), SHEIB-AGM(B) and Epi-SSA(P) on the DNME3 100 dataset.

(XLSX)

pone.0311223.s039.xlsx^{(9.9KB, xlsx)}

S18 Table. Execution time comparisons between DECMDR(D), HS-MMGKG(G), SEE(S), SHEIB-AGM(B) and Epi-SSA(P) on the DNME3 100 dataset.

(XLSX)

pone.0311223.s040.xlsx^{(10.6KB, xlsx)}

S19 Table. The brief description on the WTCCC dataset.

(XLSX)

pone.0311223.s041.xlsx^{(9.8KB, xlsx)}

S20 Table. Epistatic interactions found by Epi-SSA (n = 600 maxG = 800000).

(XLSX)

pone.0311223.s042.xlsx^{(17.5MB, xlsx)}

S21 Table. Gene pairs of the epistastic interactions detected on the seven GWAS data using Epi-SSA.

(XLSX)

pone.0311223.s043.xlsx^{(7.4MB, xlsx)}

S22 Table. Genes of the epistastic interactions detected on the seven GWAS data using Epi-SSA.

(XLSX)

pone.0311223.s044.xlsx^{(800.8KB, xlsx)}

Data Availability Statement

[pone.0311223.ref001] 1. Schork NJ, Murray SS, Frazer KA, Topol EJ. Common vs. rare allele hypotheses for complex diseases. Current Opinion in Genetics & Development. 2009;19: 212–219. doi: 10.1016/j.gde.2009.04.010 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0311223.ref002] 2. Gilissen C, Hoischen A, Brunner HG, Veltman JA. Unlocking Mendelian disease using exome sequencing. Genome Biol. 2011;12: 228. doi: 10.1186/gb-2011-12-9-228 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0311223.ref003] 3. Hernandez DG, Reed X, Singleton AB. Genetics in Parkinson disease: Mendelian versus non-Mendelian inheritance. J Neurochem. 2016;139: 59–74. doi: 10.1111/jnc.13593 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0311223.ref004] 4. van Heyningen V. Mechanisms of non-Mendelian inheritance in genetic disease. Human Molecular Genetics. 2004;13: R225–R233. doi: 10.1093/hmg/ddh254 [DOI] [PubMed] [Google Scholar]

[pone.0311223.ref005] 5. Cordell HJ. Epistasis: what it means, what it doesn’t mean, and statistical methods to detect it in humans. Human Molecular Genetics. 2002;11: 2463–2468. doi: 10.1093/hmg/11.20.2463 [DOI] [PubMed] [Google Scholar]

[pone.0311223.ref006] 6. Phillips PC. Epistasis — the essential role of gene interactions in the structure and evolution of genetic systems. Nat Rev Genet. 2008;9: 855–867. doi: 10.1038/nrg2452 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0311223.ref007] 7. Mackay TF, Moore JH. Why epistasis is important for tackling complex human disease genetics. Genome Med. 2014;6: 125. doi: 10.1186/gm561 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0311223.ref008] 8. Mackay TFC, Anholt RRH. Pleiotropy, epistasis and the genetic architecture of quantitative traits. Nature Reviews Genetics. 2024. doi: 10.1038/s41576-024-00711-3 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0311223.ref009] 9. Collins A, Lonjou C, Morton NE. Genetic epidemiology of single-nucleotide polymorphisms. Proceedings of the National Academy of Sciences. 1999;96: 15173–15177. doi: 10.1073/pnas.96.26.15173 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0311223.ref010] 10. Schork NJ, Fallin D, Lanchbury JS. Single nucleotide polymorphisms and the future of genetic epidemiology: SNPs and genetic epidemiology. Clinical Genetics. 2000;58: 250–264. doi: 10.1034/j.1399-0004.2000.580402.x [DOI] [PubMed] [Google Scholar]

[pone.0311223.ref011] 11. Gray IC, Campbell DA, Spurr NK. Single nucleotide polymorphisms as tools in human genetics. Human Molecular Genetics. 2000;9: 2403–2408. doi: 10.1093/hmg/9.16.2403 [DOI] [PubMed] [Google Scholar]

[pone.0311223.ref012] 12. Zhou T, Chen B, Ke Q, Zhao J, Pu F, Wu Y, et al. Development and Evaluation of a High-Throughput Single-Nucleotide Polymorphism Array for Large Yellow Croaker (Larimichthys crocea). Front Genet. 2020;11: 571751. doi: 10.3389/fgene.2020.571751 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0311223.ref013] 13. Sud A, Kinnersley B, Houlston RS. Genome-wide association studies of cancer: current insights and future perspectives. Nat Rev Cancer. 2017;17: 692–704. doi: 10.1038/nrc.2017.82 [DOI] [PubMed] [Google Scholar]

[pone.0311223.ref014] 14. Erdmann J, Kessler T, Munoz Venegas L, Schunkert H. A decade of genome-wide association studies for coronary artery disease: the challenges ahead. Cardiovascular Research. 2018. [cited 8 Feb 2022]. doi: 10.1093/cvr/cvy084 [DOI] [PubMed] [Google Scholar]

[pone.0311223.ref015] 15. Uffelmann E, Huang QQ, Munung NS, De Vries J, Okada Y, Martin AR, et al. Genome-wide association studies. Nat Rev Methods Primers. 2021;1: 59. doi: 10.1038/s43586-021-00056-9 [DOI] [Google Scholar]

[pone.0311223.ref016] 16. Naito T, Inoue K, Sonehara K, Baba R, Kodama T, Otagaki Y, et al. Genetic Risk of Primary Aldosteronism and Its Contribution to Hypertension: A Cross-Ancestry Meta-Analysis of Genome-Wide Association Studies. Circulation. 2023;147: 1097–1109. doi: 10.1161/CIRCULATIONAHA.122.062349 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0311223.ref017] 17. Yang C, He Z, Wan X, Yang Q, Xue H, Yu W. SNPHarvester: a filtering-based approach for detecting epistatic interactions in genome-wide association studies. Bioinformatics. 2009;25: 504–511. doi: 10.1093/bioinformatics/btn652 [DOI] [PubMed] [Google Scholar]

[pone.0311223.ref018] 18. Wan X, Yang C, Yang Q, Xue H, Tang NLS, Yu W. Predictive rule inference for epistatic interaction detection in genome-wide association studies. Bioinformatics. 2010;26: 30–37. doi: 10.1093/bioinformatics/btq486 [DOI] [PubMed] [Google Scholar]

[pone.0311223.ref019] 19. Wan X, Yang C, Yang Q, Xue H, Fan X, Tang NLS, et al. BOOST: A Fast Approach to Detecting Gene-Gene Interactions in Genome-wide Case-Control Studies. The American Journal of Human Genetics. 2010;87: 325–340. doi: 10.1016/j.ajhg.2010.07.021 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0311223.ref020] 20. Wang Y, Liu X, Robbins K, Rekaya R. AntEpiSeeker: detecting epistatic interactions for case-control studies using a two-stage ant colony optimization algorithm. BMC Res Notes. 2010;3: 117. doi: 10.1186/1756-0500-3-117 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0311223.ref021] 21. Jing P-J, Shen H-B. MACOED: a multi-objective ant colony optimization algorithm for SNP epistasis detection in genome-wide association studies. Bioinformatics. 2015;31: 634–641. doi: 10.1093/bioinformatics/btu702 [DOI] [PubMed] [Google Scholar]

[pone.0311223.ref022] 22. Tuo S, Zhang J, Yuan X, Zhang Y, Liu Z. FHSA-SED: Two-Locus Model Detection for Genome-Wide Association Study with Harmony Search Algorithm. Xue Y, editor. PLoS ONE. 2016;11: e0150669. doi: 10.1371/journal.pone.0150669 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0311223.ref023] 23. Yang C-H, Chuang L-Y, Lin Y-D. CMDR based differential evolution identifies the epistatic interaction in genome-wide association studies. Wren J, editor. Bioinformatics. 2017;33: 2354–2362. doi: 10.1093/bioinformatics/btx163 [DOI] [PubMed] [Google Scholar]

[pone.0311223.ref024] 24. Sun L, Liu G, Su L, Wang R. HS-MMGKG: A Fast Multi-objective Harmony Search Algorithm for Two-locus Model Detection in GWAS. CBIO. 2019;14: 749–761. doi: 10.2174/1574893614666190409110843 [DOI] [Google Scholar]

[pone.0311223.ref025] 25. Sun L, Liu G, Su L, Wang R. SEE: a novel multi-objective evolutionary algorithm for identifying SNP epistasis in genome-wide association studies. Biotechnology & Biotechnological Equipment. 2019;33: 529–547. doi: 10.1080/13102818.2019.1593052 [DOI] [Google Scholar]

[pone.0311223.ref026] 26. Liu Y, Wang D, He F, Wang J, Joshi T, Xu D. Phenotype Prediction and Genome-Wide Association Study Using Deep Convolutional Neural Network of Soybean. Front Genet. 2019;10: 1091. doi: 10.3389/fgene.2019.01091 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0311223.ref027] 27. Tuo S, Liu H, Chen H. Multipopulation harmony search algorithm for the detection of high-order SNP interactions. Robinson P, editor. Bioinformatics. 2020;36: 4389–4398. doi: 10.1093/bioinformatics/btaa215 [DOI] [PubMed] [Google Scholar]

[pone.0311223.ref028] 28. Sun L, Liu G, Wang R. SHEIB-AGM: A Novel Stochastic Approach for Detecting High-Order Epistatic Interactions Using Bioinformation With Automatic Gene Matrix in Genome-Wide Association Studies. IEEE Access. 2020;8: 21676–21693. doi: 10.1109/ACCESS.2020.2969465 [DOI] [Google Scholar]

[pone.0311223.ref029] 29. Mieth B, Rozier A, Rodriguez JA, Höhne MMC, Görnitz N, Müller K-R. DeepCOMBI: explainable artificial intelligence for the analysis and discovery in genome-wide association studies. NAR Genomics and Bioinformatics. 2021;3: lqab065. doi: 10.1093/nargab/lqab065 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0311223.ref030] 30. Bayat A, Hosking B, Jain Y, Hosking C, Kodikara M, Reti D, et al. Fast and accurate exhaustive higher-order epistasis search with BitEpi. Sci Rep. 2021;11: 15923. doi: 10.1038/s41598-021-94959-y [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0311223.ref031] 31. Chen Y, Xu F, Pian C, Xu M, Kong L, Fang J, et al. EpiMOGA: An Epistasis Detection Method Based on a Multi-Objective Genetic Algorithm. Genes. 2021;12: 191. doi: 10.3390/genes12020191 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0311223.ref032] 32. Tuo S, Li C, Liu F, Zhu Y, Chen T, Feng Z, et al. A Novel Multitasking Ant Colony Optimization Method for Detecting Multiorder SNP Interactions. Interdisciplinary Sciences: Computational Life Sciences. 2022;14: 814–832. [DOI] [PubMed] [Google Scholar]

[pone.0311223.ref033] 33. Tuo S, Liu F, Feng Z, Li C, Zhu Y, Chen T, et al. Membrane computing with harmony search algorithm for gene selection from expression and methylation data. Journal of Membrane Computing. 2022;4: 293–313. doi: 10.1007/s41965-022-00111-8 [DOI] [Google Scholar]

[pone.0311223.ref034] 34. Gu Y, Sun Y, Shang J, Li F, Guan B, Liu J-X. Multi-Objective Artificial Bee Colony Algorithm Based on Scale-Free Network for Epistasis Detection. Genes. 2022;13: 871. doi: 10.3390/genes13050871 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0311223.ref035] 35. Tuo S, Li C, Liu F, Li A, He L, Geem ZW, et al. MTHSA-DHEI: multitasking harmony search algorithm for detecting high-order SNP epistatic interactions. Complex Intell Syst. 2023;9: 637–658. doi: 10.1007/s40747-022-00813-7 [DOI] [Google Scholar]

[pone.0311223.ref036] 36. Tuo S, Jiang J. A Novel Detection Method for High-Order SNP Epistatic Interactions Based on Explicit-Encoding-Based Multitasking Harmony Search. Interdisciplinary Sciences: Computational Life Sciences. 2024. [DOI] [PubMed] [Google Scholar]

[pone.0311223.ref037] 37. Xue J, Shen B. A novel swarm intelligence optimization approach: sparrow search algorithm. Systems Science & Control Engineering. 2020;8: 22–34. doi: 10.1080/21642583.2019.1708830 [DOI] [Google Scholar]

[pone.0311223.ref038] 38. Li B, Wang H. Multi-objective sparrow search algorithm: A novel algorithm for solving complex multi-objective optimisation problems. Expert Systems with Applications. 2022;210: 118414. doi: 10.1016/j.eswa.2022.118414 [DOI] [Google Scholar]

[pone.0311223.ref039] 39. Gharehchopogh FS, Namazi M, Ebrahimi L, Abdollahzadeh B. Advances in Sparrow Search Algorithm: A Comprehensive Survey. Arch Computat Methods Eng. 2023;30: 427–455. doi: 10.1007/s11831-022-09804-w [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0311223.ref040] 40. Awadallah MA, Al-Betar MA, Doush IA, Makhadmeh SN, Al-Naymat G. Recent Versions and Applications of Sparrow Search Algorithm. Archives of Computational Methods in Engineering. 2023;30: 2831–2858. doi: 10.1007/s11831-023-09887-z [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0311223.ref041] 41. Urbanowicz RJ, Kiralis J, Sinnott-Armstrong NA, Heberling T, Fisher JM, Moore JH. GAMETES: a fast, direct algorithm for generating pure, strict, epistatic models with random architectures. BioData Mining. 2012;5: 16. doi: 10.1186/1756-0381-5-16 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0311223.ref042] 42. The Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007;447: 661–678. doi: 10.1038/nature05911 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0311223.ref043] 43. Sherry ST. dbSNP: the NCBI database of genetic variation. Nucleic Acids Research. 2001;29: 308–311. doi: 10.1093/nar/29.1.308 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0311223.ref044] 44. Sayers EW, Barrett T, Benson DA, Bolton E, Bryant SH, Canese K, et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Research. 2012;40: D13–D25. doi: 10.1093/nar/gkr1184 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0311223.ref045] 45. Mattingly CJ, Colby GT, Forrest JN, Boyer JL. The Comparative Toxicogenomics Database (CTD). Environmental Health Perspectives. 2003;111: 793–795. doi: 10.1289/ehp.6028 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0311223.ref046] 46. Kohl M, Wiese S, Warscheid B. Cytoscape: Software for Visualization and Analysis of Biological Networks. In: Hamacher M, Eisenacher M, Stephan C, editors. Data Mining in Proteomics. Totowa, NJ: Humana Press; 2011. pp. 291–303. [DOI] [PubMed] [Google Scholar]

PERMALINK

Epi-SSA: A novel epistasis detection method based on a multi-objective sparrow search algorithm

Liyan Sun

Jingwen Bian

Yi Xin

Linqing Jiang

Linxuan Zheng

Roles

Abstract

Introduction

Materials and methods

Fig 1. The overall structure and execution flow of Epi-SSA.

Determine the values of mo and ml

Initialize the positions of the sparrow population

Calculate the three objective functions of the sparrows in the population

Fig 2. The strategy of sorting the sparrow population based on the rank.

Update the positions of the producers

Update the positions of the scroungers

Generate n × sd new sparrows

Detect epistatic interactions in the population

Local optimum has been reached

Generate the results

Results and discussion

Experiments on simulated datasets

Table 1. Algorithm parameter settings employed in the experiments on simulated data.

Fig 3. F-measure comparisons between AntEpiSeeker(A), DECMDR(D), HS-MMGKG(G), SEE(S), SHEIB-AGM(B), SNPHarvester(H), SNPRuler(R) and Epi-SSA(P) on the DME 100 dataset.

Fig 4. F-measure comparisons between AntEpiSeeker(A), DECMDR(D), HS-MMGKG(G), SEE(S), SHEIB-AGM(B), SNPHarvester(H), SNPRuler(R) and Epi-SSA(P) on the DNME 100 dataset.

Fig 5. F-measure comparisons between AntEpiSeeker(A), DECMDR(D), HS-MMGKG(G), SEE(S), SHEIB-AGM(B), SNPHarvester(H), SNPRuler(R) and Epi-SSA(P) on the DME 1000 dataset.

Fig 6. F-measure comparisons between AntEpiSeeker(A), DECMDR(D), HS-MMGKG(G), SEE(S), SHEIB-AGM(B), SNPHarvester(H), SNPRuler(R) and Epi-SSA(P) on the DNME 1000 dataset.

Fig 7. F-measure comparisons between DECMDR(D), HS-MMGKG(G), SEE(S), SHEIB-AGM(B) and Epi-SSA(P) on the DNME3 100 dataset.

Experiments on real data

Table 2. The real GWAS data for the seven common complex diseases.

Table 3. Part of epistatic interactions found by Epi-SSA (n = 600 maxG = 800000).

Fig 8. The SNP network of the epistatic interactions detected for Bipolar Disorder.

Table 4. Part of gene pairs of the epistastic interactions detected on the seven GWAS data using Epi-SSA.

Table 5. Part of genes of the epistastic interactions detected on the seven GWAS data using Epi-SSA.

Fig 9. The gene network of the epistatic interactions detected for Bipolar Disorder.

Conclusion

Supporting information

Data Availability

Funding Statement

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases