Abstract
Detecting high-order epistasis in genome-wide association studies (GWASs) is of importance when characterizing complex human diseases. However, the enormous numbers of possible single-nucleotide polymorphism (SNP) combinations and the diversity among diseases presents a significant computational challenge. Herein, a fast method for detecting high-order epistasis based on an interaction weight (FDHE-IW) method is evaluated in the detection of SNP combinations associated with disease. First, the symmetrical uncertainty (SU) value for each SNP is calculated. Then, the top-k SNPs are isolated as guiders to identify 2-way SNP combinations with significant interaction weight values. Next, a forward search is employed to detect high-order SNP combinations with significant interaction weight values as candidates. Finally, the findings were statistically evaluated using a G-test to isolate true positives. The developed algorithm was used to evaluate 12 simulated datasets and an age-related macular degeneration (AMD) dataset and was shown to perform robustly in the detection of some high-order disease-causing models.
Keywords: Single-nucleotide polymorphism, high-order epistasis, interaction weight
1. Introduction
In recent years, genome-wide association studies (GWASs) have played an important role in identifying single-nucleotide polymorphisms (SNPs) associated with complex human diseases. This approach is non-candidate-driven, and investigates the entire genome, thus offering a more comprehensive method when compared to gene-specific candidate-driven studies [1]. Genome-wide association studies was first employed by Klein et al. to investigate patients with age-related macular degeneration (AMD), and they identified two SNPs (rs380390 and rs10272438) with significant AMD associations [2]. Since its conception, 1800 diseases and traits, and thousands of associated SNPs have been identified, with the main focus being on individual SNPs that are isolated based on their contribution to disease status [3]. However, more recently, it has become widely accepted that multiple SNPs may contribute to a given pathogenicity via epistasis. Epistasis is defined as the effect of one gene (allele) on a phenotype that is modified by another gene (allele) or several other genes (alleles) [4], and includes additive/synergistic epistasis, dominant epistasis, recessive epistasis, functional epistasis, and sign epistasis [5].
Nevertheless, detecting and identifying the disease-causing SNP combinations on a genome-wide scale is met with several challenges. First, there is an enormous computational burden that is associated with the examination of SNP combinations due to multiple testing. Second, it is challenging to develop a method that is able to reliably identify disease-causing SNP combinations from those that are not given the diversity that exists among disease models [6], especially when there is insufficient sample data.
To tackle these challenges, some algorithms were developed to detect synergistic SNP combinations associated with complex diseases. The majority of these methods can be classified into three categories: exhaustive methods [7,8,9,10,11], filtering methods (SNPHarvester) [12,13], or artificial intelligence (including swarm intelligence and heuristic search methods) [14,15,16,17,18,19,20,21,22].
In the exhaustive method, all SNP combinations from the original data set are verified, which ensures that disease-causing SNP combinations are seldomly missed, but it bears an enormous computation burden. To speed up the calculations, some lightweight scoring methods are employed. However, they usually prefer a small part of a disease model. The classical exhaustive methods include Boolean operation-based screening and testing (BOOST) [7], graphical processing unit (GPU)-implementation of BOOST (GBOOST) [8], GPU-based tools for parallel permutation tests in GWASs (PBOOST) [9], and epistasis analysis based on multi-objective optimization (ESMO) [10]. Boolean operation-based screening and testing utilizes a Boolean operation to speed up the examination of pairwise SNP interactions using an exhaustive search approach. GBOOST and PBOOST further accelerate detection by employing GPUs. Epistasis multi-objective optimization utilizes exhaustive methods to evaluate all SNP combinations using mutual entropy and a Bayesian network. It is not feasible for high-order epistasis detection, due to the enormous computational burden.
The filtering method applies data-driven and biological knowledge filters. The data-driven filter reduces SNP numbers by calculating the correlation between individual SNPs and disease status. Single-nucleotide polymorphisms below a given threshold are filtered out, and some disease-causing SNPs with a very low effect may potentially also be removed. The biological knowledge filter groups genes in the light of their biological functions, and detects the epistatic interactions within a group, but introduces a bias towards well-known genes, and against less well-characterized genes [3,12,13].
Artificial intelligent algorithms, such as bayesian epistasis association mapping (BEAM) [14], Ant colony optimization based epistatic interaction (AntEpiSeeker) [15], Cuckoo search epitasis (CSE) [16], multi-objective ant colony optimization epistasis detection (MACOED) [17], fast harmony search algorithm based SNP epistasis detection (FHSA-SED) [18], niche harmony search algorithm based high-order SNP combination detection (NHSA-DHSC) [19], Co-Information basedN-Order epistasis detector and visualizer (CINOEDV) [20], and high-order interaction seeker(HiSeeker) [22] have attracted attention when detecting high-order epistatic interactions, due to a reduced computational burden, which is due to not all SNP combinations being examined. However, these algorithms are often sensitive to parameters, and easily trapped in local searches [23,24].
Recently, information theory has focused on selection from high-dimensional data sets [25,26,27]. In this study, an approach termed a fast approach for detecting high-order epistasis with interaction weight (FDHE-IW) was developed. In this approach, k-way epistasis interactions are detected, with the detection process divided into two stages: searching and testing. During the searching stage, an improved method is utilized to calculate the symmetrical uncertainty (SU) of each SNP locus with a given phenotype. Next, several SNPs with high SU values are chosen as guiders to isolate 2-way, 3-way, and k-way SNP combinations that are significantly associated with disease status. These SNP combinations are recorded as candidate solutions. During the testing stage, the G-test statistical method was employed to test the association significance of the candidate solutions. An example schematic for detecting 3-way SNP combinations can be seen in Figure 1.
2. Materials and Methods
Sets of SNP variables are defined by with N SNP loci, where Xi is the genotype variable of the ith SNP locus with values of , accounting for the homozygous major allele (0), the heterozygous allele (1) and the homozygous minor allele (2). C denotes the phenotype variable with values of ( for a given disease phenotype). For a k-way SNP combination (such as a 3-way SNP combination ), I denotes the number of genotype combinations (I = 3k for k-way SNP combination), J is the number of phenotype states (C), with J = 2 for a given disease phenotype (it equals 2 for a case-control dataset. 1 denotes case label, 0 is the control label). The number of samples in a dataset is defined as ni, with the SNP loci taking the value of the ith genotype combination (1 ≤ i ≤ I), and nij representing the number of samples of the ith genotype combination being associated with the phenotype state (cj).
2.1. Definitions
“Information entropy” has been defined as the average amount of information that is produced by a stochastic source of data that can be used to measure the data distribution diversity, and to measure the uncertainty of random variables [28]. To introduce our method well, the terms were defined:
Entropy: Let be the probability of the ith genotype of a SNP variable (X), with the entropy of X being expressed as:
(1) |
Joint entropy: Let be k SNP variables, with being a particular genotype value for these SNP loci, and being the probability of the genotype occurring together. The joint entropy (JE) of multiple variables is defined as:
(2) |
The JE is a measure of the uncertainty that is associated with a set of variables and can be used to measure the genotype distribution of a k-way SNP combination ; however, it cannot be used in assessing genotype–phenotype correlations. Recently, mutual information has attracted extensive attention for identifying the association.
Mutual information: To measure the mutual dependence between a SNP locus (X) and phenotype (C), the mutual information was defined as follows:
(3) |
where denotes the marginal probability distribution function of X.
Joint mutual information: The joint mutual information between k variables (X1, X2, Xk) and the phenotype (C) was defined as follows:
(4) |
The mutual information can also be regarded as a measure of the interaction strength between two SNPs (X and C), or between a SNP (X) and the disease status (C).
Interaction gain: Mutual information can also be called an interaction gain (IG) between X and Y. Based on a previous proposed interaction gain with three variables X, Y, and C [29], the following was derived:
(5) |
In GWASs, X and Y can be regarded as SNP loci, and C can be either a SNP variable or phenotype variable. In IG(X; Y; C), a 2-way interaction between SNP X and SNP Y for phenotype C can be depicted, or a 3-way interaction between X, Y, and C can be indicated. If X, Y, and C are a feature set, such as in , can be regarded as an IG between X and Y.
In Equation (5), if then X and Y together yield a more synergistic effect on phenotype than would be expected from their sum individually. Thus, X and Y interacting with each other (interaction effect), and the presence of Y will increase the ability of predicting the phenotype (C). Conversely, indicates a redundancy between X and Y.
In GWAS, measuring the interaction between genotype variables (X, Y) and phenotype C is very important but difficult. In this study, a new interaction weight (IW) factor was employed.
Interaction weight factor: The interaction weight factor (IWF) between X and Y has been previously [30] defined as:
(6) |
IWF(X, Y) has the following properties:
-
(1)
0 ≤ IWF(X, Y) ≤ 2
-
(2)
1 ≤ IWF(X, Y) ≤ 2 if X interacts with Y.
-
(3)
0 ≤ IWF(X, Y) ≤ 1 if X is redundant to Y.
Evaluating the association of all SNP combinations with specific phenotypes may be time-consuming, and the computational evaluation of high-order SNP combinations is generally difficult to perform. To speed up the process of detecting epistasis from high-dimensional data sets, the present study employed Symmetrical uncertainty in identifying search seeds.
Symmetrical uncertainty (SU): Mutual information has been widely adopted for data mining from high-dimensional data. However, it tends to favor features with more values while ignoring interactive features [31]. In this study, SU was utilized to compensate for the bias of mutual information toward features with more values, as previously described [27,30], and this was defined as Equation (7):
(7) |
In Equation (7), I(X;C) is used to measure the mutual dependence between the SNP locus (X) and the disease status (C). H(X) and H(C) measure the diversity of the SNP genotype distribution and the phenotype, respectively. Nevertheless, SU cannot effectively measure low associations. To enhance the ability to detect lower SNP and phenotype associations, the SU equation was modified as follows:
(8) |
In Equation (8), H(X) + H(C) is replaced with H(X, C) because we have found that the joint distribution of variables X* and C (let X* be the disease locus) usually have a smaller degree of dispersion than those of the other variables, namely, X and C. This improved Equation (8) thus enables the identification of some susceptibility loci with low marginal effects. This proposed algorithm is further described in Algorithm 1.
Algorithm 1: FDHE-IW |
Inputs: D (s1, s2, …, sN,
C)—the given data set with N + 1 columns; si denotes the values of the ith SNP locus for all samples. T—the candidate size; θ—the threshold of the G-test p-value; k—the number of SNPs in a k-way SNP combination; and K—the number to find the SNP combinations based on a seed SNP. Outputs: SNP combinations (SC)—the k-way SNP combinations that are associated with disease status. |
|
|
The FDHE-IW algorithm first calculates the SU value for each SNP (Step (2)). Then the SNP with a maximum SU value is chosen (see step (3.1)) to find K k-way SNP combinations that are associated with the status of the diseases. In step (3.2) of the FDHE-IW algorithm, the IW and weight coefficient (W) for each SNP in F are updated iteratively, and a new SNP having a maximum relevance with the phenotype is combined with SNP .
Time Complexity: In the FDHE-IW algorithm, the time complexity is defined as O(N + N*(k!)*K). Generally, the values of k and K are very small, and the value of (k!)*K is much less than N. Therefore, the time complexity of FDHE-IW is less than O(N2), which is a feasible computation amount for current computers to detect high-order SNP epistasis from a data set with thousands to millions of SNPs.
G-Test: A G-test is a maximum likelihood statistical significance test [32]. Compared to a Chi-square test, the G-test will lead to the same test results for samples of a rational size. However, for some cell cases, it is always better than the Chi-squared test [33].
In this study, an improved G-test method [19] was employed to verify the association between genotype and phenotype. For the k-way SNP combination model, the formula for calculating the G value is as follows:
The degree of freedom d (d = (I − 1)(J − 1)) is modified correspondingly, as follows:
where, and are the observed numbers and the expected number of genotypes, respectively, when the phenotype takes the state yj, and the genotype takes the ith k-combination. ln denotes the natural logarithm function. The observed number is obtained from the dataset by using a simple counting statistical method. The expected genotype frequency number (Eij) is obtained according to Hardy–Weinberg principles [34]. is a small integer that is less than or equal to 5.
2.2. Performance Evaluation and Simulation Data Sets
2.2.1. Performance Evaluation
To evaluate the performance of the proposed algorithm, equations for the power Equation (10), the F-measure Equation (11), the recall Equation (12), and the precision Equation (13) were utilized.
(9) |
(10) |
(11) |
(12) |
Power is a measure of the capability to detect disease-causing models in all datasets, where #S is the number of disease-causing models from all #T datasets (there are 100 data matrices for each disease model).
True positives (TPs) are defined as the discovery of a k-way SNP combination that is associated with disease status, and FNs (false negatives) are defined as a non-discovery of a SNP combination that is associated with disease. TNs (true negatives) indicate no discovery, and FPs (false positives) are defined as a k-way SNP combination that is falsely associated with a disease status [31].
In this experiment, recall, precision and F-measure were used to evaluate the statistical precision of this hypothesis testing method (G-test in our method) for finding disease models in the screening stage. #TP is equal to the number of disease-causing SNP combinations that have passed the threshold p-value, while #FN is the number of the disease-causing SNP combinations that failed to pass the threshold. #FP is the number of non-disease-causing combinations that passed the threshold, while #TN equals the number of non-disease-causing combinations that failed to pass the threshold.
2.2.2. Simulation Data Sets and Case Study
Twelve disease loci with marginal effects (DME)-simulated disease models (multiplicative models: DME-1–DME-4, threshold models: DME-5–DME-8, and concrete models: DME-9–DME-12) (see Tables S1 and S2 in Supplementary File 2) that are well characterized were utilized [17], with 100 simulated data sets being generated for each DME disease model using GAMETES_2.0 [35]. One of the generated data sets contained 100 SNPs, with 800 controls and 800 cases; while another contained 1000 SNPs, with 2000 cases and 2000 controls. The experimental results of our algorithm were then compared with BEAM [14], BOOST [7], MACOED [17], and SNPHarvester [12] results. BEAM is a classical heuristic search algorithm for detecting epistasis interactions. It applies a Bayesian partitioning model to disease-associated markers and their interactions, and uses a Markov chain Monte Carlo (MCMC) to compute the posterior probability that each marker set is associated with a given disease [14]. Boost detects the SNP epistasis interaction very rapidly using a filter and an exhaustive method. SNPHarvester is a classical and effective filtering-based approach for detecting epistatic interactions in genome-wide association studies. MACOED is a new intelligent search algorithm for detecting high-order epistatic interactions that reduces the number of SNP combinations that are examined in association with a phenotype.
To further validate the algorithm, real AMD data containing 103,611 genotypes SNPs for 50 controls and 96 cases [14] were utilized. To balance the case samples and controls, we enlarged the control sample size to 96 using a bootstrap method [36], which can significantly increase the statistical power, and imputed missing data using the k-nearest-neighbor method [37].
2.2.3. Paremeters Setting
In the DME simulation experiments, the threshold θ of the G-test p-value was set equal to , with minor allele frequency (MAF). The candidate size set to T = 2*k for simulation data sets and T = 200 for AMD. K = k for simulation data sets and K = 5 for AMD data. All experiments were performed using a Windows 10 operation system with Intel(R) Core(TM) i7-4790 CPU@3.6GHz and 8 GB memory, and all program codes were written in MATLAB R2015b (MathWorks, Natick, MA, USA). The source code is in Supplementary File FDHE-IW.rar.
3. Results
3.1. Simulated Models
The detection power of FDHE-IW was first investigated by comparing it with four state-of-the-art algorithms (BEAM, SNPHarvester, MACOED, and BOOST) using a DME dataset with 100 SNPs (Figure 2) and a data set with 1000 SNPs (Figure 3). The results in Figure 2 and Figure 3 show that FDHE-IW is superior to the other four algorithms when analyzing DME-1 and DME-3–DME-10, and is comparable with BOOST for DME-2 and DME-12.
The recall, precision and F-measure were also determined when using FDHE-IW for all 12 DME models (Table 1), with the five comparative algorithms also evaluated using a subset of DME models (Table 2 and Table 3). These results indicated that the performance of FDHE-IW when using the dataset with 1600 samples was inferior when compared to the dataset with 4000 samples in recall, precision, and F-measure.
Table 1.
Models | 100 SNPs (1600 Sample Size) | 1000 SNPs (4000 Sample Size) | ||||
---|---|---|---|---|---|---|
Recall | Precision | F-Measure | Recall | Precision | F-Measure | |
DME-1 | 26.09% | 54.55% | 35.29% | 67.74% | 84.00% | 75.00% |
DME-2 | 15.00% | 27.27% | 19.35% | 75.00% | 78.57% | 76.74% |
DME-3 | 44.83% | 44.83% | 44.83% | 79.25% | 55.26% | 65.12% |
DME-4 | 60.00% | 33.96% | 43.37% | 92.21% | 43.56% | 59.17% |
DME-5 | 69.62% | 91.67% | 79.14% | 100.00% | 85.22% | 92.02% |
DME-6 | 82.11% | 78.00% | 80.00% | 100.00% | 53.48% | 69.69% |
DME-7 | 96.00% | 64.43% | 77.11% | 100.00% | 35.46% | 52.36% |
DME-8 | 100.00% | 31.35% | 47.73% | 100.00% | 33.33% | 50.00% |
DME-9 | 95.00% | 73.08% | 82.61% | 99.00% | 73.88% | 84.62% |
DME-10 | 95.00% | 93.14% | 94.06% | 99.00% | 86.84% | 92.52% |
DME-11 | 98.41% | 100.00% | 99.20% | 100.00% | 100.00% | 100.00% |
DME-12 | 96.88% | 97.89% | 97.38% | 98.85% | 98.85% | 98.85% |
Table 2.
Models | Algorithms | Recall | Precision | F-Measure | Runtime(s) |
---|---|---|---|---|---|
Multiplicative model | FDHE-IW | 36.5% | 40.2% | 35.7% | 2.95 |
MACOED | 68.5% | 90.8% | 71.7% | 10.8 | |
BEAM | 14.8% | 10.3% | 7.7% | 8.52 | |
BOOST | 0.5% | 62.5% | 0.5% | 0.6 | |
SNPHarvester | 0.3% | 50.0% | 0.5% | 2.97 | |
Threshold model | FDHE-IW | 86.9% | 66.4% | 71.0% | 2.95 |
MACOED | 98.0% | 83.0% | 89.3% | 11.40 | |
BEAM | 84.5% | 68.3% | 60.5% | 8.52 | |
BOOST | 34.8% | 99.0% | 37.1% | 0.95 | |
SNPHarvester | 51.8% | 47.0% | 29.0% | 2.92 | |
Concrete model | FDHE-IW | 96.3% | 91.0% | 93.3% | 2.95 |
MACOED | 98.8% | 84.8% | 91.0% | 11.59 | |
BEAM | 81.3% | 62.3% | 70.1% | 8.52 | |
BOOST | 66.3% | 87.3% | 69.7% | 0.66 | |
SNPHarvester | 91.3% | 57.3% | 67.7% | 2.90 |
FDHE-IW: A fast approach for detecting high-order epistasis with interaction weight; MACOED: Multi-objective ant colony optimization epistasis detection; BEAM: Bayesian epistasis association mapping; BOOST: Boolean operation-based screening and testing.
Table 3.
Models | Algorithms | Recall | Precision | F-Measure | Runtime(s) |
---|---|---|---|---|---|
Multiplicative model | FDHE-IW | 78.55% | 65.35% | 69.01% | 65.4 |
MACOED | 75.00% | 32.85% | 20.82% | 440 | |
BEAM | 35.12% | 37.84% | 17.65% | 308 | |
BOOST | 18.50% | 22.95% | 9.40% | 4 | |
SNPHarvester | 5.55% | 32.20% | 3.61% | 130 | |
Threshold model | FDHE-IW | 100.00% | 51.87% | 66.02% | 66.6 |
MACOED | 100.00% | 14.31% | 11.08% | 450 | |
BEAM | 98.46% | 75.27% | 42.64% | 199 | |
BOOST | 88.50% | 62.50% | 33.70% | 9 | |
SNPHarvester | 87.50% | 29.45% | 12.24% | 143 | |
Concrete model | FDHE-IW | 99.21% | 89.89% | 94.00% | 35.3 |
MACOED | 100.00% | 38.60% | 23.89% | 303 | |
BEAM | 66.94% | 63.05% | 31.68% | 133 | |
BOOST | 72.50% | 20.83% | 15.82% | 6 | |
SNPHarvester | 84.00% | 92.45% | 43.80% | 62.3 |
The results in Table 2 and Table 3 showed that the performance of FDHE-IW was superior to BEAM, SNPHarvester, and BOOST in recall, precision, and F-measure. However, it was found to be inferior to MACOED for most of the DME models. This is because some disease-causing SNP combinations identified by FDHE-IW in the searching stage were then rejected in the testing stage, but in MACOED, only SNP combinations that were significantly associated with disease status were identified. For many disease-causing SNP combinations with a low MAF and low hereditary, MACOED fails to identify them in the screening stage. Therefore, the MACOED has higher precision than FDHE-IW in the testing stage, but the detection power of MACOED is much lower than that of FDHE-IW (Figure 2). As can be seen in Table 2 and Table 3, FDHE-IW has a lower runtime than MACOED, BEAM, and SNPHarvester, but it has a longer runtime than Boost.
3.2. Experimental Results Using an AMD Dataset
To further validate the FDHE-IW algorithm, a real AMD data set was evaluated. It took 5 h to find 200 2-way candidate solutions, 15 h to find 200 3-way candidate solutions, and 25 h to find 8 4-way candidate solutions. The algorithm identified 45 2-way SNP combinations (p-value < 10−11), 18 3-way SNP-combinations (p-value < 1 × 10−15; Table 4) and 2 4-way SNP-combinations (p-value = 0; Table 5) that were associated with AMD out of the 103,611 examined SNPs (Supplementary File 1, sheets 1-way–4-way). The obtained findings were further examined using Cytoscape (http://www.cytoscape.org/) [38], and a 2-way SNP interaction network and a 2-way gene interaction network were constructed (Figure 4), where the genes were mapped based on the SNP interaction networks.
Table 4.
SNP1 | Gene1 | SNP2 | Gene2 | SNP3 | Gene3 | G-test p-Value |
---|---|---|---|---|---|---|
rs380390 | CFH | rs1930022 | NA | rs3913094 | NA | 2.22 × 10−16 |
rs380390 | CFH | rs10504709 | NA | rs2402053 | NA | 1.11 × 10−16 |
rs380390 | CFH | rs10504709 | NA | rs2380684 | NA | 1.11 × 10−16 |
rs380390 | CFH | rs2380684 | NA | rs2224762 | JMJD2C | 0 |
rs380390 | CFH | rs10504548 | NA | rs2224762 | JMJD2C | 0 |
rs380390 | CFH | rs2402053 | NA | rs2224762 | JMJD2C | 0 |
rs380390 | CFH | rs718263 | NCALD | rs2224762 | JMJD2C | 0 |
rs1329428 | CFH | rs3775652 | INPP4B | rs6598991 | NA | 2.22 × 10−16 |
rs725518 | RRM1 | rs3775652 | INPP4B | rs1002979 | NA | 2.22 × 10−16 |
Table 5.
snp1 | Gen1 | SNP2 | Gen2 | SNP3 | Gen3 | SNP4 | Gen4 | G-test p-Value |
---|---|---|---|---|---|---|---|---|
rs1740752 | PCCA | rs4772270 | NA | rs7044653 | NA | rs6598991 | NA | 0 |
rs4772270 | NA | rs7044653 | NA | rs1329428 | CFH | rs6598991 | NA | 0 |
Within the SNP network (Figure 4), two widely reported SNPs (rs380390, rs1329428) that are located in an intron within the CFH gene were also identified. The CFH gene has been commonly association with AMD [14,18,19]. Furthermore, the constructed gene network showed that many of the SNPs are mapped to non-gene coding regions, thus denoted NA. Within the interaction network, NA and CFH had six connections due to six SNP pairs within the SNP network being mapped to a CFH–NA gene-pair. CFH was also found to have a novel connection with the JMJD2C gene, a histone lysine demethylase. JMJD2C has been reported to play a crucial role in the progression of breast cancer, prostate carcinomas, osteosarcoma, and blood neoplasms, thus indicating that JMJD2C represents a promising anti-cancer target [39,40,41]. These findings also suggest that JMJD2C may have an important role in AMD.
Eight 3-way gene combinations containing the CFH gene, and four 3-way gene combinations containing the JMJD2C gene were identified. Only one 3-way combination did not involve the CFH and JMJD2C genes, suggesting that these are important to AMD.
The two 4-way SNP combinations in Table 5 shows little uncertainty on whether every SNP contributes to the phenotype. Therefore, the AUCs (areas under the curve) of each SNP involved in a potential SNP combination (1-way, 2-way, and 3-way) were computed and then compared to the AUC relative to the 4-way SNP combinations (see Figures S1 and S2 in Supplementary File 3). The AUC of the 4-way SNP combination was larger than those of other sub SNP combinations. Hence, each SNP in the SNP combinations contribute to the development of AMD.
Recent investigations have utilized AMD data sets, which include algorithm IOBLPSO [42], epiACO [43], BEAM [14], epi forest [44], DCHE [45], FHSA-SED [18] and NHSA-DHSC [19]. Table 6 summarizes the results of these seven studies. Two SNPs (rs380390, rs1329428) and the CFH gene have been reported by all the seven algorithms. However, other SNPs and SNP combinations that are associated with AMD were detected by different algorithms. The proposed FDHE-IW also detected novel SNPs and genes, where rs10511467 (in NA) and rs3776652 (in the JMJD2C gene) were also reported by FHSA-SED and NHSA-DHSC. SNPs rs6598991 and rs10507949, both in NA, have not been reported to date.
Table 6.
FDHE-IW | BEAM | epi Forest | DCHE | FHSA-SED | NHSA-DHSC | epiACO | |
---|---|---|---|---|---|---|---|
Relevant SNPs or genes identified |
SNPs: rs380390 rs1329428 rs10511467 rs6598991 rs10507949 rs3776652 Genes: CFH JMJD2C INPP4B |
SNPs: rs380390 rs1329428 Gene: CFH |
SNPs: rs380390 rs1329428 rs1394608 rs7104698 Gene: CFH |
SNPs: rs380390 rs1329428 rs1394608 rs1740752 rs1363688 rs10512174 rs618499 rs1926489 Genes: CFH ZNF25 SGCD LRIG3 DRD1 ISCA1 |
SNPs: rs380390 rs1329428 rs10272438 rs1740752 rs3775652 rs1394608 rs1363688 rs10511467 Genes: CFH BBS9 SGCDINPP4B |
SNPs: rs380390 rs1329428 rs10272438 rs1363688 rs1394608 rs3775652 rs7104698 rs10511467 rs10512413 Genes: CFH INPP4B BBS9 ABL1ANKS1B |
SNPs: rs380390 rs1329428 rs1363688 rs1394608 rs2224762 rs9328536 rs943008 rs718263 Genes: CFH MED27 KDM4C NCALD NEDD9 |
4. Discussion
In this study, a fast-search method based on the interaction weight was proposed to detect k-way SNP epistasis that is associated with disease. When utilizing simulation data sets, the method developed herein was shown to be more powerful than other comparable algorithms in detecting the disease-causing SNP combinations at the searching stage. However, in the testing stage, the balance between type I and type II errors is very difficult to manage, due to the p-value threshold for distinguishing between true and false disease-causing combinations being very different for different disease models. Therefore, multiple statistical tests, such as the Chi-square test, t-test, and G-test, were evaluated, with the G-test being shown to be the most robust. When utilizing the newly developed algorithm to evaluate an AMD dataset, almost all of the well-known disease-causing SNP loci associated with AMD and some new SNP combinations were identified. However, the method developed herein identified many SNPs that are in non-coding genomic regions, which will require further examination in future studies. The detection of disease-associated SNPs in high-order disease models is a very difficult problem.
5. Conclusions
5.1. Advantage
The FDHE-IW algorithm does not need to evaluate all k-way SNP combinations to detect k-way disease-causing SNP combinations. It is capable of detecting some higher-order epistasis on a whole genome scale with a time complexity that is much less than O(N2).
5.2. Limitations
The proposed algorithm can effectively detection nested epistasis [35], suggesting that one or more of the interacting loci are the major contributors to disease, and that at least one proper subset of the loci also interacts epistatically. However, for some disease-causing SNP interaction combinations with pure, strict epistasis, the detection power of FDHE-IW is still unsatisfactory.
5.3. Future Work
At present there is still no fast or effective approach for detecting various disease-causing models with multi-loci in GWAS, due to the enormous computational burden. Therefore, detecting high-order disease models has room to be explored using high-performance methods and cloud computing. In future research, we will also focus on non-coding genomic regions, we will continue to focus on rare variants in GWASs, and we will develop new methods to aid in the identification of the causes of complex diseases. With the rapid development of high-performance cloud computing techniques, these abilities should continue to improve.
Acknowledgments
We would like to thank LetPub (www.letpub.com) for providing linguistic assistance during the preparation of this manuscript.
Supplementary Materials
The following are available online at http://www.mdpi.com/2073-4425/9/9/435/s1, Supplementary File 1.xls: All the SNP combinations and gene combinations; Supplementary File 2.pdf: Table S1: Penetrance functions of the three DME epistasis models, Table S2: the parameters and the penetrance values of 12 DME models; Supplementary File 3.pdf: Figure S1: AUC curves of 4-way SNP combination (SNP1: rs1740752 SNP2: rs4772270 SNP3: rs7044653 SNP4: rs6598991); Figure S2: AUC curves of 4-way SNP combination (SNP1: rs4772270, SNP2: rs7044653, SNP3: rs1329428, SNP4: rs6598991); Supplementary Material 4: FDHE-IW.rar: The matlab source code of FDHE-IW algorithm.
Author Contributions
Conceptualization, S.T.; Methodology, S.T.; Software, S.T.
Funding
This work was supported by the Natural Science Foundation of China under grant 61571341.
Conflicts of Interest
The authors declare no conflicts of interest.
References
- 1.Manolio T.A. Genomewide association studies and assessment of the risk of disease. N. Engl. J. Med. 2010;363:166–176. doi: 10.1056/NEJMra0905980. [DOI] [PubMed] [Google Scholar]
- 2.Klein R.J., Zeiss C., Chew E.Y., Tsai J.Y., Sackler R.S., Haynes C., Henning A.K., SanGiovanni J.P., Mane S.M., Mayne S.T., et al. Complement factor H polymorphism in age-related macular degeneration. Science. 2005;308:385–389. doi: 10.1126/science.1109557. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Upton A., Trelles O., Cornejo-García J.A., Perkins J.R. Review: High-performance computing to detect epistasis in genome scale data sets. Brief. Bioinform. 2016;17:368–379. doi: 10.1093/bib/bbv058. [DOI] [PubMed] [Google Scholar]
- 4.Jiang R. Gene-gene interaction. In: Gellman M.D., Turner J.R., editors. Encyclopedia of Behavioral Medicine. Springer; New York, NY, USA: 2013. pp. 841–842. [Google Scholar]
- 5.Stanfill A.G., Starlarddavenport A. Primer in Genetics and Genomics, Article 7-Multifactorial Concepts: Gene-Gene Interactions. Biol. Res. Nurs. 2018;20:359–364. doi: 10.1177/1099800418761098. [DOI] [PubMed] [Google Scholar]
- 6.Moore J.H., Asselbergs F.W., Williams S.M. Bioinformatics challenges for genome-wide association studies. Bioinformatics. 2010;26:445–455. doi: 10.1093/bioinformatics/btp713. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Wan X., Yang C., Yang Q., Xue H., Fan X., Tang N.L., Yu W. BOOST: A fast approach to detecting gene–gene interactions in genome-wide case–control studies. Am. J. Hum. Genet. 2010;87:325–340. doi: 10.1016/j.ajhg.2010.07.021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Ling S.Y., Yang C., Xiang W., Yu W. Gboost: A gpu-based tool for detecting gene–gene interactions in genome–wide case control studies. Bioinformatics. 2011;27:1309. doi: 10.1093/bioinformatics/btr114. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Yang G., Jiang W., Yang Q., Yu W. PBOOST: A GPU based tool for parallel permutation tests in genome-wide association studies. Bioinformatics. 2015;31:1460–1462. doi: 10.1093/bioinformatics/btu840. [DOI] [PubMed] [Google Scholar]
- 10.Li X. A fast and exhaustive method for heterogeneity and epistasis analysis based on multi-objective optimization. Bioinformatics. 2017;33:2829–2836. doi: 10.1093/bioinformatics/btx339. [DOI] [PubMed] [Google Scholar]
- 11.Hahn L.W., Ritchie M.D., Moore J.H. Multifactor dimensionality reduction software for detecting gene-gene and gene-environment interactions. Bioinformatics. 2003;19:376–382. doi: 10.1093/bioinformatics/btf869. [DOI] [PubMed] [Google Scholar]
- 12.Yang C., He Z., Wan X., Yang Q., Xue H., Yu W. SNPHarvester: A filtering-based approach for detecting epistatic interactions in genome-wide association studies. Bioinformatics. 2009;25:504–511. doi: 10.1093/bioinformatics/btn652. [DOI] [PubMed] [Google Scholar]
- 13.Crawford L., Ping Z., Mukherjee S. Detecting epistasis with the marginal epistasis test in genetic mapping studies of quantitative traits. PLoS Genet. 2017;13:e1006869. doi: 10.1371/journal.pgen.1006869. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Zhang Y., Liu J.S. Bayesian inference of epistatic interactions in case–control studies. Nat. Genet. 2007;39:1167–1173. doi: 10.1038/ng2110. [DOI] [PubMed] [Google Scholar]
- 15.Wang Y., Liu X., Robbins K., Rekaya R. AntEpiSeeker: Detecting epistatic interactions for case-control studies using a two-stage ant colony optimization algorithm. BMC Res. Notes. 2010;3:117. doi: 10.1186/1756-0500-3-117. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Aflakparast M., Salimi H., Gerami A., Dubé M.P., Visweswaran S., Masoudi-Nejad A. Cuckoo search epitasis: A new method for exploring significant genetic interactions. Heredity. 2014;112:666–674. doi: 10.1038/hdy.2014.4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Jing P.-J., Shen H.-B. MACOED: A multi-objective ant colony optimization algorithm for SNP epistasis detection in genome-wide association studies. Bioinformatics. 2015;31:634–641. doi: 10.1093/bioinformatics/btu702. [DOI] [PubMed] [Google Scholar]
- 18.Tuo S., Zhang J., Yuan X., Zhang Y., Liu Z. FHSA-SED: Two-locus model detection for genome-wide association study with harmony search algorithm. PLoS ONE. 2016;11 doi: 10.1371/journal.pone.0150669. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Tuo S., Zhang J., Yuan X., He Z., Liu Y., Liu Z. Niche harmony search algorithm for detecting complex disease associated high-order SNP combinations. Sci. Rep. 2017;7:11529. doi: 10.1038/s41598-017-11064-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Shang J., Sun Y., Liu J.X., Xia J., Zhang J., Zheng C.H. CINOEDV: A co-information based method for detecting and visualizing n -order epistatic interactions. BMC Bioinform. 2016;17:214. doi: 10.1186/s12859-016-1076-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Sinoquet C., Niel C. Enhancement of a stochastic Markov blanket framework with ant colony optimization, to uncover epistasis in genetic association studies. Bioinformatics. 2018;15:673–6780. doi: 10.1093/bioinformatics/bty154. [DOI] [PubMed] [Google Scholar]
- 22.Liu J., Yu G., Jiang Y., Wang J. HiSeeker: Detecting high-order SNP interactions based on pair-wise SNP combinations. Genes. 2017;8:153. doi: 10.3390/genes8060153. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Tuba M. Plenary lecture 3: Swarm Intelligence Algorithms Parameter Tuning; Proceedings of the WSEAS International Conference on Computer Engineering and Applications, and Proceedings of the 2012 American Conference on Applied Mathematics; Cambridge, UK. 25–27 January 2012. [Google Scholar]
- 24.Menezes B.A.M., Wrede F., Kuchen H., de Lima Neto F.B. Parameter Selection for Swarm Intelligence Algorithms—Case Study on Parallel Implementation of FSS; Proceedings of the 2017 IEEE Latin American Conference on Computational Intelligence; Arequipa, Peru. 8–10 November 2017. [Google Scholar]
- 25.Vinh N.X., Zhou S., Chan J., Bailey J. Can high-order dependencies improve mutual information based feature selection? Pattern Recognit. 2016;53:46–58. doi: 10.1016/j.patcog.2015.11.007. [DOI] [Google Scholar]
- 26.Shishkin A., Bezzubtseva A., Drutsa A., Shishkov I., Gladkikh E., Gusev G., Serdyukov P. Efficient High-Order Interaction-Aware Feature Selection Based on Conditional Mutual Information; Proceedings of the 30th Conference on Neural Information Processing Systems; Barcelona, Spain. 5–10 December 2016. [Google Scholar]
- 27.Song Q., Ni J., Wang G. a fast clustering-based feature subset selection algorithm for high-dimensional data. IEEE Trans. Knowl. Data Eng. 2012;25:1–14. doi: 10.1109/TKDE.2011.181. [DOI] [Google Scholar]
- 28.Claude E. A mathematical theory of communication. Bell Syst. Tech. J. 1948;27:379–423. doi: 10.1002/j.1538-7305.1948.tb01338.x. [DOI] [Google Scholar]
- 29.Jakulin A., Bratko I. Testing the Significance of Attribute Interactions; Proceedings of the Twenty-First International Conference on Machine Learning; Banff, AB, Canada. 4–8 July 2004; pp. 409–416. [Google Scholar]
- 30.Zeng Z., Zhang H., Zhang R., Yin C. A novel feature selection method considering feature interaction. Pattern Recognit. 2015;48:2656–2666. doi: 10.1016/j.patcog.2015.02.025. [DOI] [Google Scholar]
- 31.Niel C., Sinoquet C., Dina C., Rocheleau G., Kelso J. SMMB—A stochastic Markov-blanket framework strategy for epistasis detection in GWAS. Bioinformatics. 2018;34:2773–2780. doi: 10.1093/bioinformatics/bty154. [DOI] [PubMed] [Google Scholar]
- 32.McDonald J.H. Handbook of Biological Statistics. 3rd ed. Sparky House Publishing; Baltimore, MD, USA: 2014. G–Test Goodness-of-Fit; pp. 53–58. [Google Scholar]
- 33.Harremoës P., Tusnády G. Information divergence is more chi squared distributed than the chi squared statistic. arXiv. 2012. 1202.1125
- 34.Crow J.H. Weinberg and language impediments. Genetics. 1999;152:821–825. doi: 10.1093/genetics/152.3.821. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Urbanowicz R.J., Kiralis J., Sinnott-Armstrong N.A., Heberling T., Fisher J.M., Moore J.H. GAMETES: A fast, direct algorithm for generating pure, strict, epistatic models with random architectures. BioData Min. 2012;5:1–14. doi: 10.1186/1756-0381-5-16. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Vélez J.I., Chandrasekharappa S.C., Henao E., Martinez A.F., Harper U., Jones M., Solomon B.D., Lopez L., Garcia G., Aguirre-Acevedo D.C., et al. Pooling/bootstrap-based GWAS (pbGWAS) identifies new loci modifying the age of onset in PSEN1 p.Glu280Ala Alzheimer’s disease. Mol. Psychiatr. 2013;18:568–575. doi: 10.1038/mp.2012.81. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Tutz G., Ramzan S. Improved methods for the imputation of missing data by nearest neighbor methods. Comput. Stat. Data Anal. 2015;90:84–99. doi: 10.1016/j.csda.2015.04.009. [DOI] [Google Scholar]
- 38.Shannon P., Markiel A., Ozier O., Baliga N.S., Wang J.T., Ramage D., Amin N., Schwikowski B., Ideker T. Cytoscape: A software environment for integrated models of biomolecular interaction networks. Genome Res. 2003;13:2498–2504. doi: 10.1101/gr.1239303. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Zhang C., Wang Z., Ji Q., Li Q. Histone demethylase JMJD2C: Epigenetic regulators in tumors. Oncotarget. 2017;8:91723–91733. doi: 10.18632/oncotarget.19176. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Hong Q., Yu S., Yang Y., Liu G., Shao Z. A polymorphism in JMJD2C alters the cleavage by caspase-3 and the prognosis of human breast cancer. Oncotarget. 2014;5:4779–4787. doi: 10.18632/oncotarget.2029. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Burton A., Azevedo C., Andreassi C., Riccio A., Saiardi A. Inositol pyrophosphates regulate JMJD2C-dependent histone demethylation. Proc. Natl. Acad. Sci. USA. 2013;110:18970–18975. doi: 10.1073/pnas.1309699110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Shang J., Sun Y., Li S., Liu J., Zheng C., Zhang J. An improved opposition-based learning particle swarm optimization for the detection of SNP-SNP interactions. BioMed Res. Int. 2015;2015:524821. doi: 10.1155/2015/524821. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Sun Y., Shang J., Liu J., Li S., Zheng C. epiACO—A method for identifying epistasis based on ant Colony optimization algorithm. BioData Min. 2017;10:23. doi: 10.1186/s13040-017-0143-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Jiang R., Tang W., Wu X., Fu W. A random forest approach to the detection of epistatic interactions in case-control studies. BMC Bioinform. 2009;10:S65. doi: 10.1186/1471-2105-10-S1-S65. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Guo X., Meng Y., Yu N., Pan Y. Cloud computing for detecting high-order genome-wide epistatic interaction via dynamic clustering. BMC Bioinform. 2014;15:102. doi: 10.1186/1471-2105-15-102. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.