A hidden two-locus disease association pattern in genome-wide association studies

Can Yang; Xiang Wan; Qiang Yang; Hong Xue; Nelson LS Tang; Weichuan Yu

doi:10.1186/1471-2105-12-156

. 2011 May 14;12:156. doi: 10.1186/1471-2105-12-156

A hidden two-locus disease association pattern in genome-wide association studies

Can Yang ^1,^✉, Xiang Wan ^1,^✉, Qiang Yang ², Hong Xue ³, Nelson LS Tang ⁴, Weichuan Yu ^1,^✉

PMCID: PMC3116488 PMID: 21569557

Abstract

Background

Recent association analyses in genome-wide association studies (GWAS) mainly focus on single-locus association tests (marginal tests) and two-locus interaction detections. These analysis methods have provided strong evidence of associations between genetics variances and complex diseases. However, there exists a type of association pattern, which often occurs within local regions in the genome and is unlikely to be detected by either marginal tests or interaction tests. This association pattern involves a group of correlated single-nucleotide polymorphisms (SNPs). The correlation among SNPs can lead to weak marginal effects and the interaction does not play a role in this association pattern. This phenomenon is due to the existence of unfaithfulness: the marginal effects of correlated SNPs do not express their significant joint effects faithfully due to the correlation cancelation.

Results

In this paper, we develop a computational method to detect this association pattern masked by unfaithfulness. We have applied our method to analyze seven data sets from the Wellcome Trust Case Control Consortium (WTCCC). The analysis for each data set takes about one week to finish the examination of all pairs of SNPs. Based on the empirical result of these real data, we show that this type of association masked by unfaithfulness widely exists in GWAS.

Conclusions

These newly identified associations enrich the discoveries of GWAS, which may provide new insights both in the analysis of tagSNPs and in the experiment design of GWAS. Since these associations may be easily missed by existing analysis tools, we can only connect some of them to publicly available findings from other association studies. As independent data set is limited at this moment, we also have difficulties to replicate these findings. More biological implications need further investigation.

Availability

The software is freely available at http://bioinformatics.ust.hk/hidden_pattern_finder.zip.

Background

The development of DNA microchip technology has allowed the analysis of single nucleotide polymorphism (SNPs) on a genome-wide scale to identify genetic variants associated with diseases. Researchers have proposed many methods to investigate association patterns of complex diseases. Two recent reviews [1,2] presented detailed analyses on many popular methods and tools, such as multifactor dimensionality reduction (MDR) [3], Random Jungle [4], Bayesian epistasis association mapping (BEAM) [5] and PLINK [6]. MDR is a popular non-parametric approach for detecting all possible k-way combinations of SNPs that interact to influence disease traits. Random Jungle (i.e., Random Forest [7]), is to solve classification and regression problems. In random forest, decision trees are combined to produce accurate predication. Its ability to handle the high dimensional problems in GWAS has been shown in [8,9]. BEAM designs a Bayesian marker partition model which classifies SNP markers into three types: SNPs unassociated with the disease, SNPs contributing to the disease susceptibility independently, and SNPs influencing the disease risk jointly with each other. In this model, a first order Markov chain is designed for the accommodation of correlation between adjacent SNPs. Markov Chain Monte Carlo (MCMC) sampling is used to optimize the posterior probability of the model. In addition, the "B-statistic" designed in BEAM can be used in the frequentist hypothesis-testing framework. PLINK provides a toolkit for flexible analyses, in which various statistical tests for single-locus analysis, haplotype analysis and allelic-based interaction analysis are implemented. Recently, a new method named "BOOST" [10] allows examination of all pairwise interactions in genome-wide case-control studies. As a result, many genetic susceptibility determinants have been mapped.

However, there is another type of association pattern, which often occurs within local regions in the genome and may not be detected by these methods. This association pattern involves multiple correlated SNPs with neither strong marginal effects nor strong interaction effects. But they can jointly display strong associations. Here we use some simple regression models to explain this association pattern. Suppose we have two dependent variables, X₁and X₂, and one independent variable Y. We can fit two regression models (or logistic regressions for case-control data), Inline graphic and , to test the association significance of these two variables. Here and are named as marginal coefficients. If these two marginal coefficients are very small, single variable analysis methods will consider them statistically insignificant and ignore them.

However, if X₁correlates with X₂, fitting the model Y ~ β₁X₁+ β₂X₂may identify a new association pattern with β₁and β₂(named as bivariate regression coefficients) being significantly larger than Inline graphic and . This phenomenon is referred to as unfaithfulness. It means that the marginal effects of correlated variables do not express their significant joint effects faithfully due to the correlation cancelation [11]. Figure 1 provides some synthetic examples to show the unfaithfulness involving two variables. There are four scenarios to illustrate the relationship between marginal coefficients (marked using red color) and bivariate regression coefficients. The first scenario (Figure 1:(a)) is a reference case that involves no correlations between X₁and X₂. The marginal coefficients Inline graphic and are equal to the bivariate regression coefficients β₁and β₂, respectively. In the second scenario (Figure 1:(b)), X₁is positively correlated with X₂. The marginal coefficients are bigger than the bivariate regression coefficients. In the third scenario (Figure 1:(c)), X₁is negatively correlated with X₂. The marginal coefficient Inline graphic and could be significantly smaller than the bivariate regression coefficients β₁and. β₂In the the fourth scenario (Figure 1:(d)), X₁is positively correlated with X₂. But the sign of β₁is the opposite of the sign of β₂. The correlation effect in the third scenario and the fourth scenario causes the unfaithfulness. In mathematics, the relationship between the marginal coefficients and the bivariate regression coefficients is formulated as

(1)

**Illustration of unfaithfulness in association studies**. There are three regression models in each scenario: Y ~ β₁X₁+ β₂X₂, and . In this figure, the marginal coefficient and are shown as projections (marked with bold red color) of Y on X₁and X₂, respectively. (a) X₁is not correlated with X₂. (b) X₁is positively correlated with X₂. (c) X₁is negatively correlated with X₂. (d) X₁is positively correlated with X₂but the sign of β₁is the opposite of the sign of β₂. Scenario (c) and Scenario (d) illustrate unfaithfulness.

where Inline graphic is the expectation of the marginal coefficient , ρ(X₁, X₂) is the population correlation between X₁and X₂. The marginal coefficients depend on their bivariate regression coefficients as well as the variable correlation, as we illustrated in Figure 1. We will give more explanation on this relationship in the high dimensional setting in the discussion section.

From the statistical point of view, different correlation patterns could cause the marginal coefficients Inline graphic and significantly different from the bivariate regression coeffcients β₁and β₂(shown in Figure 1). In fact, the issue of unfaithfulness has been discussed in the causality literature [12]. In GWAS, the correlation among SNPs arises due to the linkage disequilibrium pattern of the genome. A natural question arises: Does the issue of unfaithfulness occur in GWAS?

To answer this question, a computational method for detecting associations masked by unfaithfulness is necessary. In this paper, we propose a simple method to detect such associations involving two SNPs. It can evaluate each SNP pair in genome-wide case-control studies in a fast manner. We have applied our method to analyze seven data sets from the Wellcome Trust Case Control Consortium (WTCCC). The experimental results show that these associations widely exist in GWAS. In this work, we only handle the unfaithfulness issue involving two SNPs while the unfaithfulness can exist among a large number of markers. The detection of associations involving three or more SNPs is too time-consuming and beyond the scope of this work.

Results

Experiment on simulation study

The simulation study is designed to compare our proposed method with other three methods for detecting associations in the presence of unfaithfulness. These three methods include the marginal association test (single-locus analysis), Lasso [11,13] and BEAM [5]. The reasons that we choose these methods for comparison are as follows:

• Marginal association test is used in almost every GWAS due to its simplicity and effectiveness.

• Lasso is a shrinkage and selection method for (generalized) linear regression. It imposes a sparsity constraint (i.e., only a small fraction of variables are relevant) and uses L₁penalty to eliminate irrelevant variables. Fast algorithms are available for Lasso. Thus, it can simultaneously analyze a huge number of variables. It is very popular in genetics [11,14-16].

• BEAM has the capability of detecting both marginal associations and interactions in large-scale data sets. It uses first order Markov chain to accommodate the correlation between adjacent SNPs.

The details about the parameter settings in simulation are provided in the method section. In our simulation study, we only handle the unfaithfulness involving two associated variables X₁and X₂by using β₁> 0, β₂< 0 and ρ(X₁, X₂) > 0 as illustrated in Figure 1(d). The marginal coefficients Inline graphic and will be small due to the cancelation given by Equation (1). When β₁> 0, β₂> 0 and ρ(X₁, X₂) < 0 the unfaithfulness also happens. This corresponds to a situation that the minor alleles of both X₁and X₂increase the diseases risk but X₁and X₂are negatively correlated, as illustrated in Figure 1(c).

The results in Figure 2 indicate that it is difficult for existing methods to detect the association masked by unfaithfulness while our proposed method achieves reasonable performance. Specifically, the poor performance of the marginal association test is not surprising since the marginal effects are weak in the presence of unfaithfulness. Although Lasso can simultaneously analyze all SNPs, it still suffers from the difficulty of detecting associations masked by unfaithfulness. This agrees with the analysis result in [11]. BEAM has a better performance, which should be attributed to its first order Markov chain designed for the accommodation of correlation. But its performance is still not comparable with the performance of our proposed method in most settings.

**The performance comparison of four methods: Marginal association tests, Lasso, BEAM and the proposed exhaustive two-locus joint analysis**. 100 data sets are generated under each parameter setting. 1000 samples (500 cases and 500 controls) are simulated in each data set. The power is calculated as the proportion of the 100 data sets in which the disease associated SNPs are detected.

Another interesting point is that the statistical power of existing methods decreases as the linkage disequilibrium (LD) r²increases. Although our proposed method also degrades its performance when LD increases, it maintains a relatively high power for strong LD (r²= 0.7).

Experiment on seven data sets from WTCCC

We have applied our method to analyze the data sets (14,000 cases in total and 3,000 shared controls) from the WTCCC [17]. WTCCC studies seven common human diseases, including bipolar disorder (BD), coronary artery disease (CAD), Crohn's disease (CD), hypertension (HT), rheumatoid arthritis (RA), type 1 diabetes (T1D) and type 2 diabetes (T2D). These data sets are generated using the affymetrix 500 K chip. We first apply a similar quality control procedure as suggested in [17] to pre-process the data. The numbers of remaining SNPs for seven data sets are around 360,000. In current stage, BEAM cannot directly handle these data sets [2].

Table 1 lists the numbers of identified two-locus associations masked by unfaithfulness under three statistical significance thresholds with and without the distance threshold for seven data sets. It shows that the unfaithfulness widely exists in these data sets. Some associations masked by unfaithfulness involve SNPs with at least 1 M base pair distance. However, all of them are located in the major histocompatibility complex (MHC) region (The MHC region encodes a large number of genes. It has extensive polymorphism and linkage disequilibrium with the long distance [18]). Therefore, the results in Table 1 provide the evidence that this association pattern typically occurs in local area. These results also suggest that using the local search can speed up the whole process in the future.

Table 1.

The number of two-locus unfaithfulness associations identified from seven diseases data sets under different constraints.

	BD	CAD	CD	HT	RA	T1D	T2D

T¹	48	31	25	46	132	153	67

T²	52	35	28	51	153	204	80

T³	60	36	29	52	165	252	84

T¹& Dist	0	0	0	1	1	1	0

T²& Dist	0	0	0	3	17	17	0

T ³& Dist	0	0	0	0	4	35	0

Open in a new tab

T¹- the threshold of Bonferroni-corrected P-value is 0.10; T²- the threshold of Bonferroni-corrected P-value is 0.20; T³- the threshold of Bonferroni-corrected P-value is 0.30; Dist - the physical distance threshold between two SNPs is at least 1 Mb. This threshold is used to see how many unfaithfulness associations involving two remote loci

From the identified associations, we further conduct the gene mapping and identify some suspicious genes closely related with the disease traits. Table 2 and Table 3 report the unadjusted single-locus P-values, the unadjusted joint P-values, the marginal coefficients and joint bivariate coefficients for these associations. The other details are listed in the supplementary document (Additional file 1). These identified associations coincide with Figure 1(d). To date, we can only connect some identified associations to publicly available results from other association studies. Many identified association patterns still remain unexplained. In the following, we explain the details of some associations that are confirmed by other studies.

Table 2.

Some associations masked by unfaithfulness from the WTCCC data set.

Disease	SNP X_p	Single-locus P -value	SNP X_q	Single-locus P -value	Chr	Gene	Unfaithfulness P -value
BD	rs668860	0.053	rs10873672	0.245	1	MCOLN2	4.885 × 10^-15
	rs668860	0.053	rs6691970	0.216	1	MCOLN2	6.217 × 10^-15

CAD	rs7162070	0.867	rs16969478	0.160	15	FSIP1	5.551 × 10^-15
	rs1876853	0.903	rs16969478	0.160	15	FSIP1	2.310 × 10^-13
	rs8029602	0.853	rs16969478	0.160	15	FSIP1	5.274 × 10^-14
	rs16969475	0.823	rs16969478	0.160	15	FSIP1	1.259 × 10^-13

T1D	rs1058318	0.074	rs2252745	0.840	6	GNL1, PPP1R10	1.326 × 10^-12

HT	rs2300390	0.460	rs12482676	0.061	21	RCAN1	2.442 × 10^-15

Open in a new tab

Table 3.

Regression coefficients of those associations listed in Table 2.

Disease	SNP X_p	(z Value)	SNP X_q	(z Value)	r²	β₁(z Value)	β₂(z Value)
BD	rs668860	0.0162 (0.392)	rs10873672	0.0679 (1.662)	0.961	-1.379 (-5.823)	1.402 (5.989)
	rs668860	0.0162 (0.392)	rs6691970	0.0706 (1.728)	0.958	-1.397 (-5.934)	1.421 (6.107)

CAD	rs7162070	0.0301 (0.482)	rs16969478	-0.108 (-1.732)	0.913	2.621 (5.671)	-2.637 (-5.720)
	rs1876853	0.0208 (0.331)	rs16969478	-0.108 (-1.732)	0.913	2.354 (5.538)	-2.372 (-5.603)
	rs8029602	0.0175 (0.281)	rs16969478	-0.108 (-1.732)	0.914	2.214 (5.594)	-2.238 (-5.676)
	rs16969475	0.0184 (0.294)	rs16969478	-0.108 (-1.732)	0.913	2.110 (5.660)	-2.132 (-5.746)

T1D	rs1058318	0.0992 (2.269)	rs2252745	-0.0167 (-0.375)	0.887	1.0292 (7.530)	-1.005 (-7.219)

HT	rs2300390	-0.0600 (-1.182)	rs12482676	0.0531 (1.037)	0.903	-1.215 (-6.567)	1.224 (6.577)

Open in a new tab

Here we assume additive effects in regression.

Bipolar disorder (BD)

Among associated SNP pairs identified from the BD data set, we find two suspicious SNP pairs, (rs668860, rs10873672) and (rs668860, rs6691970). The unadjusted P-values for these two SNP pairs are 4.885 × 10^-15and 6.217 × 10^-15, respectively. They are still significant after Bonferroni correction. However, none of these three SNPs (rs668860 is involved in both pairs) was reported in [17] because their marginal effects are too weak to be detected by the single-locus association test. The unadjusted P-values for these three SNPs based on the single-locus association test are 0.053, 0.245 and 0.216, respectively. All three SNPs reside in the intron of gene MCOLN2. The protein Mucolipin-2, encoded by gene MCOLN2 and also known as TRPML2 (transient receptor potential cation channel, mucolipin subfamily, member 2), has been confirmed to have strong associations with bipolar disorder in a family-based association study [19]. To our knowledge, this is the first identified association between the MCOLN2 gene and the bipolar disorder risk in a population-based association study.

Figure 3 shows the joint distributions of the pair (rs668860, rs10873672) (The other pair shares a similar pattern.) in cases and controls and the corresponding odds ratios. The genotype combination "CT/TT" has a significantly higher odds ratio than other genotype combinations. Further investigations of the MCOLN2 gene may help identify the causes of bipolar disorder disease.

**Distributions of genotypes of rs668860 and rs10873672 in the bipolar disorder data set and the odds ratio computed for combined genotypes of these two SNPs**. Left Panel: The distribution of genotypes of rs668860 and rs10873672 in case samples. Middle Panel: The distribution of genotypes of rs668860 and rs10873672 in control samples. Right panel: The estimated odds ratio for the combination of rs668860 and rs10873672. The odds ratio of genotype combination "AA/TT" is used as reference. The genotype combination "TT/CT" has significantly higher odds ratio than other genotype combinations.

We further use BEAGLE [20] to impute the SNPs in this local area so that we can see the enriched signals after imputation. This region includes 300 SNPs. It begins with the SNP rs1030933 and ends with the SNP rs1837329. After imputation, we analyze the imputed data and the result is given in Figure 4. Figure 4(a) shows the enriched signal. The intensity shows -log₁₀P values given by the joint regression (P-value is calculated based on Inline graphic ). Figure 4(b) shows the LD structure of this local area. Figure 4(c) shows the -log₁₀P values obtained using single-SNP analysis (P-values are calculated based ). Figure 4(d) shows the locations of rs668860, rs10873672 and rs6691970. From Figure 4, we can see the following:

**Analysis result of the local region of the BD data set located by rs668860, rs10873672 and rs6691970**. (a) The enriched signal after imputation: The intensity shows -*log*₁₀P given by the joint regression. (b) The LD structure of this local area. (c) The -*log*₁₀P value obtained using single-SNP analysis. (d) The locations of rs668860, rs10873672 and rs6691970.

• Although this region is in strong LD (see Figure 4(b)), association masked by unfaithfulness does not happen across the entire area. This shows that this type of asssociation not only depends on the correlation structure but also depends on the effects of the SNPs, as we illustrated in Figure 1 (also see Equation 1).

• From Figure 4(c), the marginal effects of the imputed SNPs are very weak. This indicate that this type of association is not caused by some ungenotyped causative SNPs. Instead, it is a genuine effect.

Coronary artery disease (CAD)

We identify four suspicious associations involving five SNPs. The unadjusted P-values for these four associations range from 2.310 × 10^-13to 5.551 × 10^-15. The unadjusted single-locus P-values for five SNPs involved in these five associations indicate that they do not have noticeable marginal effects. All five SNPs reside in the intron of gene FSIP1 (fibrous sheath interacting protein 1). We have not found evidence to directly connect gene FSIP1 with the coronary artery disease. However, the LD analysis identifies a well studied gene THBS1 (thrombospondin-1), which is centromeric to gene FSIP1 and has been confirmed to increase the risk of coronary artery disease in many studies [21-23]. It would be of great interest to investigate gene FSIP1 in determining genetic susceptibility to coronary artery disease.

Here we also show the enriched signals obtained from the imputation. Figure 5(a) shows the -log₁₀P given by the joint regression. Figure 5(b) shows the LD structure (r²) in that region. Figure 5(c) shows the -log₁₀P of single SNP analysis. Figure 5(d) shows the locations of the genotyped SNPs which are listed in Table 2. Again, the marginal effects of the imputed SNPs are weak. We see clearly that the signal of unfaithfulness appears in the block-like manner.

**Analysis result of the local region of the CAD data set located by rs7162070, rs1876853, rs8029602, rs16969475 and rs16969478**. (a) The enriched signal after imputation: The -*log*₁₀P value given by the joint regression. (b) The LD structure (r²) in the same region. (c) The -*log*₁₀P of single SNP analysis. (d) The locations of the genotyped SNPs rs7162070, rs1876853, rs8029602, rs16969475 and rs16969478.

Type 1 diabetes (T1D)

Most identified associations from the T1D data set are linked with the MHC region. The MHC region at chromosomal position 6p21 encodes many genes (such as HLA-DQB1 and HLA-DRB1) that have been associated with type 1 diabetes [17,24] by using the single-locus test. However, it is still unclear which and how many loci within the MHC region determine T1D susceptibility because of the functional complexity of this small human genome segment. The MHC region has been connected with more than 100 diseases, such as diabetes, rheumatoid arthritis, psoriasis, asthma and various autoimmune disorders. Our results provide additional information to locate disease-associated loci. Concretely, one suspicious association involves SNP rs1058318 and SNP rs2252745. The unadjusted P-value of this association is 1.326 × 10^-12. The unadjusted single-locus P-values of rs1058318 and rs2252745 are 0.074 and 0.840, respectively. SNP rs1058318 resides in the intron region of gene GNL1 and SNP rs2252745 resides in the intron region of gene PPP1R10. Both genes are located in the MHC region and adjacent to each other. Gene GNL1 belongs to the HLA-E class. The locus in HLA-E has been strongly associated with type 1 diabetes [25]. The detailed examination of the relationship between gene GNL1 and gene PPP1R10 may provide some new insights in studying the causes of type 1 diabetes.

Hypertension (HT)

Among associations identified from the HT data set, we find one suspicious pair involving SNP rs2300390 and SNP rs12482676. The unadjusted P-value is 2.442 × 10^-15. The unadjusted single-locus P-values for rs2300390 and rs12482676 are 0.460 and 0.061, respectively. Both SNPs reside in the intron of gene RCAN1. Gene RCAN1 mainly functions as a regulator of calcineurin. Calcineurin participates in many cellular and tissue functions. Its abnormal expression is associated with many diseases including hypertension [26].

Crohn's disease (CD), rheumatoid arthritis (RA) and type 2 diabetes (T2D)

Currently, we have difficulties to connect the identified associations of CD, RA and T2D to publicly available findings from other association studies. Their biological implications need to be further explored.

Experiment on the Illumina data sets from other independent studies

We further analyze the Crohn's Disease data set [27], in which 308,332 autosomal SNPs were assayed on the Illumina HumanHap300 chip. After a standard quality control (the proportion of miss values ≤ 10%, the minor allele frequency ≥ 5% and the P-value of Hardy-Weinberg equilibrium ≥ 0.0001), the number of remaining SNPs is 291,964.

We apply our method to this data set and do not find any significant associations masked by unfaithfulness. Our explanation is that Illumina chip uses the tagSNP design and the correlation among SNPs is less than that of Affymetrix 500 K chip used by WTCCC. This result indicates that it is unlikely to detect associations masked by unfaithfulness using the tagSNP design.

In order to check if imputation helps in identifying significant association masked by unfaithfulness, we focus on the SNP regions in which we have identified associations from the WTCCC CD data set (Additional file 1: Table S3), impute the corresponding SNP data from [27], and re-run our analysis. Unfortunately, we fail to replicate those findings in the WTCCC CD data set. We have examined the imputation result carefully. At those local regions we are interested in, few SNPs are directly genotyped. In the hapmap data, hundreds of SNPs locate in those areas. This implies hundreds of SNPs need to been imputed using the information coming from the reference panel. In fact, the frequencies of those imputed haplotypes are almost the same in cases and controls. This is probably the reason that we cannot replicate those findings. Hopefully, next-generation sequencing will provide high resolution SNP data to resolve this issue. Another important reason may be that these two CD data sets are from different populations (one comes from Europe, another comes from north America).

Similarly, we have analyzed another RA data set [28] from Genetic Analysis Workshop 16. This data set is acquired from North American population. The SNPs are genotyped by the Illumina chip. We also have difficulties to replicate the findings of the RA data set from WTCCC. We hope we can get access to more data sets to verify our results in the future.

Discussion

The unfaithfulness issue in the high dimensional feature space

In the high dimensional feature space, many features could correlate with each other by chance, which leads to the existence of unfaithfulness and poses a great challenge on feature selection and association analysis. In this work, we only handle the unfaithfulness issue involving two variables (SNPs), while the unfaithfulness can exist among a huge number of variables. The relationship between the marginal coefficient ( Inline graphic in ) and the regression coefficient ( in β₁X₁+···+β_pX_p+···+β_sX_s) is given as follows [11]:

(2)

where Inline graphic is the expectation of marginal coefficient, ρ(X_q, X_p) is the population correlation between X_qand X_p. If , then can be close to zero no matter how large β_pis. In addition, the number of involved variables could be very big. To exclude the effect of unfaithfulness in feature selection, Fan and Lv [29] had to make an assumption that there is a C > 0 such that Inline graphic for p = 1,..., s, and then proved that the truly associated variables will be among those having the highest marginal coefficients.

In our simulation study, we only handle the unfaithfulness involving two associated variables X₁and X₂by using β₁> 0, β₂< 0 and ρ(X₁, X₂) > 0 as illustrated in Figure 1(d). The marginal coefficients Inline graphic and will be small due to the cancelation given by Equation (2). When β₁> 0, β₂> 0 and ρ(X₁, X₂) < 0, the unfaithfulness also happens. This corresponds to a situation that the minor alleles of both X₁and X₂increase the diseases risk but X₁and X₂are negatively correlated, as illustrated in Figure 1(c). The simulation result shows that the marginal test and Lasso perform poorly. The better performance of BEAM should be attributed to its first order Markov chain designed for the accommodation of correlation. Although our exhaustive method performs reasonably well, the direct extension of our method to deal with three or more loci is computationally expensive. Therefore, solving the unfaithfulness issue is still challenging.

The Relationship between interaction models and unfaithfulness

In this work, we only deal with a two-locus association pattern involving the unfaithfulness. There are many works [3,5,6,30] discussing two-locus associations. Most of them belong to the category of interaction analysis (see details in [1,2]). The SNP interaction is also referred to as "epistasis". The most common statistical definition of interactions is the statistical deviation from the additive effects of two loci on the phenotype [2]. Using the same example we discussed in the introduction section, testing interactions between X₁and X₂is to first fit the regression model (or logistic regressions for case-control data) Y ~ β₁X₁+ β₂X₂+ β₁₂X₁X₂and then test the significance of β₁₂. There is no direct connection between β₁(or β₂) and β₁₂In the analysis of unfaithfulness, the relationship between marginal coefficients ( Inline graphic , ) and joint coefficients (β₁, β₂) is given in Equation (2). The interaction term plays no role here. Therefore, it is not appropriate to use interaction models to detect associations masked by unfaithfulness.

The Relationship between unfaithfulness and confounding

Suppose we are studying the relationship between two variables X and Y using model Inline graphic . Confounding arises when there is another observed variable Z which is independently associated with X and Y. Specifically, we have for model and fot model . When studying the relationship between X and Y, it is necessary to account for the confounding effect by using model Y ~ β_yzZ + β_yxX. In other words, confounding is more like the situation illustrated in Figure 1 (b). Readers are referred to [31] for the detailed explanation of confounding.

The unfaithfulness is different. For model Inline graphic and , both and are close to zero. For joint model Y ~ β₁X₁+ β₂X₂, both β₁and β₂are not zero, as illustrated in Figure 1(c) and 1(d).

Biological interpretations

There are two possible biological interpretations. The first interpretation is illustrated in Figure 1(d). Consider two loci X_pand X_qwhich are positively correlated. When X_qincreases the disease risk (β_q> 0) and X_pacts as a protective locus (β_p< 0), unfaithfulness happens. The identified associations and their coefficients listed in Table 3 indicate that these associations indeed exist.

The second interpretation is illustrated in Figure 1(c). Consider two loci X_pand X_qwhich are negatively correlated. When both X_pand X_qincrease the disease risk (β_p> 0 and β_q> 0), unfaithfulness also happens. This case may be particularly interesting when analyzing SNPs with low allele frequencies [32]. Suppose the allele frequencies of both X_pand X_qare low and thus the mutations happening at these two loci are relatively recent. We can further assume the haplotype a - a does not exist (because the probability of both two mutations happen in a short period is very small). This implies these two loci are negatively correlated. Unfortunately, we do not identify this type of associations. Possible reasons include: (1) The current genotyping chip is designed based on the "common disease/common variant" model [33,34], the low frequency SNPs are not directly assayed. (2) The statistical power of current testing strategy is relatively low to handle rare variants.

The unfaithfulness implications on tagSNPs

GWAS is considered as a powerful approach to detecting genetic susceptibility of common diseases. Such studies require the genotypes of a huge number of SNPs across the genome, each of which is tested for association with the phenotype of interest. This is generally referred to as the direct test of association, in which the functional mutation is presumed to be genotyped. An alternative approach is to take advantage of the correlation between SNPs. This approach genotypes a subset of SNPs, called tagSNPs, which are in high linkage disequilibrium with other SNPs [33]. The association tests of tagSNPs are used to indirectly infer the association of other correlated SNPs. This approach is widely used to save genotyping costs in GWAS. Many tagging methods [33,35,36] have been developed to reduce the number of markers to be genotyped. One key assumption in these methods is that the association analysis of a set of highly correlated SNPs is equivalent with the association analysis of tagSNPs of this set. However, the existence of unfaithfulness poses a challenge for these methods. The weak marginal association of a tagSNP does not imply the weak association of the corresponding genome region in which this tagSNP is located. The reason is that some non-genotyped SNPs correlating with the tagSNPs may jointly display strong associations in the presence of unfaithfulness.

In this work, we analyzed the WTCCC data generated by the Affymetrix 500 K chip and a Crohn's disease data set generated by the Illumina chip. The Affymetrix 500 K chip spaces SNPs along the genome and the Illumina chip uses the tagSNP design. LD becomes less apparent in the Illumina data set and we did not find any association masked by unfaithfulness. This result suggests that it is very difficult to detect these associations by using the tagSNP design. If more SNPs could be genotyped in the future GWAS, we would detect more unknown associations.

Conclusion

The phenomenon named "unfaithfulness" has been discussed as a mathematical concept in the literature. In this work, we answered the question whether there exist associations masked by unfaithfulness in genome-wide association studies. We developed a simple and fast method to examine all SNP pairs and demonstrated that our method is applicable to analyze genome-wide SNP data sets. We conducted experiments on both simulated data and seven real data sets from WTCCC and identify many associations masked by unfaithfulness. As expected, these identified associations only occur in local area. This implies that only the local search is needed to find such associations.

To date, we can only connect some identified associations to publicly available results from other association studies. As independent data set is limited as this moment, we have difficulties to replicate these findings. The biological interpretation of many findings remains unclear. It would be of great interest if their biological functions could be investigated. In addition, we only handle the two-locus associations in the presence of unfaithfulness. Detecting such associations for three or more loci is still an open problem.

Methods

Given a data set with ℒ SNPs and n samples, we use X_lto denote the l-th SNP, l = 1,···, ℒ and Y to denote the class label (0 for control and 1 for case). SNPs are bi-allelic genetic markers. Capital letters (e.g. A, B,...) and lowercase letters (e.g. a, b,...) are often used to denote major and minor alleles, respectively. For simplicity, we use {0, 1, 2} to represent the three genotypes {AA, Aa, aa}, respectively.

Definition of the association masked by unfaithfulness

Considering a pair of loci X_pand X_q, four logistic regression models are typically involved to test associations masked by unfaithfulness:

(3)

(4)

(5)

and

(6)

where I(V = v) is the indicator function that takes the value 1 if V = v is true and 0 otherwise. In order to make the representation of both logistic regression models and log-linear models (introduced later) in a compact and consistent way, we use the notation adopted in [37] and rewrite the above logistic regression models in the following forms:

(7)

(8)

(9)

and

(10)

Please note that the superscripts X_pand X_qin Equation (8), Equation (9) and Equation (10) are merely the labels and do not represent the exponents. The term Inline graphic represents the coefficient of X_pat category i. Throughout this paper, we use ℳ to denote logistic regression models. We will use M to denote log-linear models. The log-likelihood function of a logistic model ℳ is denoted as L_ℳ, and its maximum likelihood estimation (MLE) is denoted as Inline graphic . The log-likelihood function of a log-linear model M is denoted as L_M, and its maximum likelihood estimation (MLE) is denoted as . For example, ℳ₁is a logistic regression model whose log-likelihood function and MLE are denoted by L_ℳ1and .

Our goal is to test if ℳ_1⊕2is significantly different from ℳ₀when both ℳ₁and ℳ₂are not. The likelihood ratio test is often used to conduct such tests. To test the difference between ℳ_1⊕2and ℳ₀, the following three steps are involved:

1. Fit a logistic regression model defined in Equation (10) and obtain the log-likelihood Inline graphic .

2. Compute the log-likelihood Inline graphic of the null logistic regression model defined in Equation (7).

3. Calculate P-value using the χ²test on the value 2 ( Inline graphic ) with degree of freedom df = 2.

Similarly, the test of difference between ℳ₁(or ℳ₂) and ℳ₀involves the following three steps:

1. Fit a logistic regression model defined in Equation (8) (or Equation (9)) to measure the main effect of X_p(or X_q) and obtain the log-likelihood Inline graphic (or ).

2. Compute the log-likelihood Inline graphic of the null logistic regression model defined in Equation (7).

3. Calculate P-value using the χ²test on the value 2 ( Inline graphic ) (or 2 ()) with degree of freedom df = 2.

Directly using regression methods for testing all pairs of SNPs in GWAS would be very time-consuming. Often the parallel computation was recommended [38]. Here, we propose to use log-linear models [37] instead of logistical regression models in GWAS. We show that this makes the likelihood ratio test computationally more efficient in genome-wide SNP data analysis. In the following, we briefly summarize the key components. The details are explained in the supplementary document (Additional file 1).

Likelihood ratio tests using log-linear models

Given two loci X_pand X_q, a contingency table of X_p, X_q, Y will be used to test the association masked by unfaithfulness between (X_p, X_q) and Y. The size of the contingency table is I × J × K, where I = 3, J = 3, K = 2. We use n_ijkto denote the observed count in the cell (i, j, k) in the contingency table (Table 4). Here n_ijkis considered as the realization of a random variable N_ijkassumed as Poisson-distributed in log-linear models.

Table 4.

The genotype counts in cases (Y = 1) and controls (Y = 2).

Y = 1	Xq = 1	Xq = 2	Xq = 3	Y = 2	Xq = 1	Xq = 2	Xq = 3

Xp = 1	n₁₁₁	n₁₂₁	n₁₃₁	Xp = 1	n₁₁₂	n₁₂₂	n₁₃₂
Xp = 2	n₂₁₁	n₂₂₁	n₂₃₁	Xp = 2	n₂₁₂	n₂₂₂	n₂₃₂
Xp = 3	n₃₁₁	n₃₂₁	n₃₃₁	Xp = 3	n₃₁₂	n₃₂₂	n₃₃₂

Open in a new tab

We use the dot convention to indicate summation over a subscript, e.g., Inline graphic is the number of observations with X_p= i. Similarly, we have and . We also have and . Throughout this paper, we use to denote the expectation of N_ijkunder log-linear model M, and use to denote the MLE of .

The equivalence between log-linear models and logistic models are given in Table 5 (see model definitions in the supplementary document (Additional file 1)). Here we construct our test statistics based on three log-linear models, which are the homogeneous association model corresponding to the logistic regression model ℳ_1⊕2, the partial independence model corresponding to the logistic regression model ℳ₁(or ℳ₂), and the block independence model corresponding to the null logistic regression model ℳ₀. Let Inline graphic , and be the log-likelihood of the homogeneous association model M_H, the partial independence model M_Pand the block independent model M_Bevaluated at their MLEs , and ,, respectively.

Table 5.

Equivalence between log-linear models and logistic models for a three-way table with binary response variable Y (M_B: Block independence model.

Log-linear model	Logistic model
	ℳ₀: β₀

Open in a new tab

M_P: Partial independence model. M_H: Homogeneous association model).

Based on the equivalence, the deviance Inline graphic of logistic regression models can be computed as

(11)

and the deviance Inline graphic (or ) can be computed as

(12)

In Equation (11) and Equation (12), Inline graphic and have the closed-form solutions (please check the supplementary document (Additional file 1) for the derivation):

(13)

and

(14)

Iterative Proportional Fitting (IPF) [37] can be used to quickly estimate Inline graphic . Specifically, initialize to be 1 for all i, j, k, then do IPF as follows:

(15)

The updating formulas may only be ill-defined if Inline graphic , , or , due to multi-collinearity. If this happens, we set , (m = 1, 2,...) to zero accordingly. Our experimental results show that this solution works well in practice (We have compared our results with the standard software R, in which the multi-collinearity problem is elegantly handled when fitting generalized linear models. It turns out that our results agree with the results given by R). Then the test statistics can be efficiently computed. As a result, we are able to test every pair of loci to search for associations masked by unfaithfulness in GWAS. Table 6 gives the running time of our method for data sets of different sizes.

Table 6.

Running time of the proposed method for data sets of different sizes.

Data size	Running time
n = 5, 000, ℒ = 1, 000	3s
n = 5, 000, ℒ = 5, 000	76s
n = 5, 000, ℒ = 10, 000	303s

Open in a new tab

All timings are carried out on one desktop computer with 3.0 GHz CPU and 4G memory running Windows XP professional x64 Edition system. Here n is the number of samples and ℒ is the number of SNPs.

An exhaustive approach to detecting the two-locus associations masked by unfaithfulness in GWAS

This approach involves the following steps:

• Step 1. For all of ℒ SNP markers, we first filter out those SNPs with significant main effects using Equation (12) since we are only interested in those markers without significant main effects. The ℒ P-values can be adjusted by either the classic Benjamini-Hochberg method for controlling false discovery rate (FDR) or the Bonferroni correction for controlling family wise error rate (FWER).

• Step 2. For the remaining ℒ' SNPs without significant main effects, we check every pair using the Equation (11). Again, the P-values can be adjusted for controlling either FDR or FWER.

The P-value thresholds in both Step 1 and Step 2 are specified by users. The default setting of the threshold is 0.1. The multiple factor for Bonferroni correction is ℒ'(ℒ' - 1)/2, where ℒ' is the number of SNPs after removing those SNPs with significant marginal effects. Since the number of SNPs with significant marginal effects only accounts for a small fraction of the entire SNP set, we have ℒ'(ℒ' - 1)/2 ≈ ℒ(ℒ - 1)/2.

Simulation design

Let p(D|G_i) denote the probability of an individual being affected given its genotype G_i(i.e., the penetrance of G_i), and let Inline graphic denote the probability of an individual not being affected given its genotype G_i. Based on the definition of the odds of a disease

(16)

the penetrance p(D|G_i) of the genotype G_ican be calculated using

(17)

The disease prevalence p(D) and genetic heritability h²are given as

(18)

(19)

The odds table of our simulation model is given in Table 7. It is a multiplicative model of odds ratio, i.e., it is an additive model on the log-odds scale. The reason we choose this model is that we try to exclude interference of the interaction effect when we discuss the unfaithfulness. Essentially, the unfaithfulness arises due to the correlation cancelation. The interaction effects play no role here.

Table 7.

The odds table of the simulation model.

Model	BB	Bb	bb	Model	BB	Bb	bb

AA	α	αθ₂₁	αθ₂₁θ₂₂	AA	α	αθ_b
Aa	αθ₁₁	αθ₁₁θ₂₁	αθ₁₁θ₂₁θ₂₂	Aa	αθ_a	αθ_aθ_b
aa	αθ₁₁θ₁₂	αθ₁₁θ₁₂θ₂₁	αθ₁₁θ₁₂θ₂₁θ₂₂	aa

Open in a new tab

The parameters α, θ₁₁, θ₁₂, θ₂₁, and θ₂₂control the prevalence p(D) (Equation(18)) and the heritability h²(Equation(19)). For simplicity, let θ₁₁= θ₁₂= θ_a, θ₂₁= θ₂₂= θ_b

For simplicity, we restrict θ₁₁= θ₁₂= θ_aand θ₂₁= θ₂₂= θ_b. The parameter θ_a> 1 means that the minor allele "a" increases the disease risk. This corresponds to the bivariate regression coefficient β₁> 0. Similarly, θ_b< 1 implies β₂< 0. In the presence of linkage disequilibrium (linkage disequilibrium measure Δ > 0), unfaithfulness arises. To simulate this situation, we further set θ_a= θ and θ_b= 1/θ. In the simulation, the prevalence p(D) and the heritability h²are controlled by the parameters α and θ. We first specify the disease prevalence p(D) and genetic heritability h². Then we numerically solve the parameters (α and θ) based on the above equations. We set p(D) = 0.1 and h²= 0.02. The details are given in the supplementary document (Additional file 1).

Authors' contributions

CY and XW contributed equally to this work. They developed the method and drafted the manuscript together. NT, QY, HX and WY initialized this work. WY finalized the manuscript. All authors read and approved the final manuscript.

Supplementary Material

Additional file 1

In the supplementary document (Additional le 1), we present the details of simulation. We also give a brief introduction to log-linear models which are used in the main article. Finally, we provide full lists of the results identified from the WTCCC data sets.

Click here for file^{(308.6KB, PDF)}

Contributor Information

Can Yang, Email: eeyang@ust.hk.

Xiang Wan, Email: eexiangw@ust.hk.

Qiang Yang, Email: qyang@cse.ust.hk.

Hong Xue, Email: hxue@ust.hk.

Nelson LS Tang, Email: nelsontang@cuhk.edu.hk.

Weichuan Yu, Email: eeyu@ust.hk.

Acknowledgements

This work was partially supported with the Grant GRF621707 from the Hong Kong Research Grant Council, grant RPC06/07.EG09, RPC07/08.EG25, RPC10EG04 and a grant from Sir Michael and Lady Kadoorie Funded Research Into Cancer Genetics. We thank the two anonymous reviewers for their constructive comments, which greatly help us improve our manuscript.

References

Balding D. A tutorial on statistical methods for population association studies. Nature Reviews Genetics. 2006;7:781–791. doi: 10.1038/nrg1916. [DOI] [PubMed] [Google Scholar]
Cordell H. Detecting gene-gene interactions that underlie human diseases. Nature Reviews Genetics. 2009;10:392–404. doi: 10.1038/nrg2579. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ritchie M, Hahn L, Roodi N, Bailey L, Dupont W, Parl F, Moore J. Multifactor-dimensionality reduction reveals high-order interactions among estrogenmetabolism genes in sporadic breast cancer. The American Journal of Human Genetics. 2001;69:138–147. doi: 10.1086/321276. [DOI] [PMC free article] [PubMed] [Google Scholar]
Schwarz D, Kónig I, Ziegler A. On Safari to Random Jungle: A fast implementation of Random Forests for high dimensional data. Bioinformatics. 2010. in press . [DOI] [PMC free article] [PubMed]
Zhang Y, Liu J. Bayesian inference of epistatic interactions in case-control studies. Nature Genetics. 2007;39:1167–1173. doi: 10.1038/ng2110. [DOI] [PubMed] [Google Scholar]
Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira M, Bender D, Maller J, Sklar P, de Bakker P, Daly M, Sham P. PLINK: a tool set for whole-genome association and population-based linkage analyses. The American Journal of Human Genetics. 2007;81:559–575. doi: 10.1086/519795. [DOI] [PMC free article] [PubMed] [Google Scholar]
Breiman L. Random Forests. Machine Learning. 2001;45:5–32. doi: 10.1023/A:1010933404324. [DOI] [Google Scholar]
Lunetta K, Hayward L, Eerdewegh PV. Screening large-scale association study data: exploiting interactions using random forests. BMC Genetics. 2004;5:32–44. doi: 10.1186/1471-2156-5-32. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bureau A, Dupuis J, Falls K, Lunetta K, Hayward B, Keith T, Van Eerdewegh P. Identifying SNPs predictive of phenotype using random forests. Genetic Epidemiology. 2005;28(2):171–182. doi: 10.1002/gepi.20041. [DOI] [PubMed] [Google Scholar]
Wan X, Yang C, Yang Q, Xue H, Fan X, Tang N, Yu W. BOOST: A fast approach to detecting gene-gene interactions in genome-wide case-control studies. The American Journal of Human Genetics. 2010;87(3):325–340. doi: 10.1016/j.ajhg.2010.07.021. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wasserman L, Roeder K. High-dimensional variable selection. The Annals of Statistics. 2009;37(5A):2178–2201. doi: 10.1214/08-AOS646. [DOI] [PMC free article] [PubMed] [Google Scholar]
Spirtes P, Glymour C, Scheines R. Causation, Prediction, and Search. MIT Press; 2001. [Google Scholar]
Tibshirani R. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society, series B. 1996;58:267–288. [Google Scholar]
Wu T, Chen Y, Hastie T, Sobel E, Lange K. Genomewide Association Analysis by Lasso Penalized Logistic Regression. Bioinformatics. 2009;25(6):714–721. doi: 10.1093/bioinformatics/btp041. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hoggart C, Whittatker J, Iorio M, Balding D. Simultaneous Analysis of All SNPs in Genome-wide and Re-Sequencing Association Studies. PLoS Genetics. 2008;4(7):e1000130. doi: 10.1371/journal.pgen.1000130. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yang C, Wan X, Yang Q, Xue H, Yu W. Identifying main effects and epistatic interactions from large-scale SNP data via adaptive group Lasso. BMC Bioinformatics. 2010;11(Suppl 1):S18. doi: 10.1186/1471-2105-11-S1-S18. [DOI] [PMC free article] [PubMed] [Google Scholar]
WTCCC. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007;447:661–678. doi: 10.1038/nature05911. [DOI] [PMC free article] [PubMed] [Google Scholar]
Guo Z, Hood L, Malkki M, Petersdorf E. Long-range multilocus haplotype phasing of the MHC. PNAS. 2006;103(18):6964–6969. doi: 10.1073/pnas.0602286103. [DOI] [PMC free article] [PubMed] [Google Scholar]
Xu C, Li P, Cooke R, Parikh S, Wang K, Kennedy J, Warsh J. TRPM2 variants and bipolar disorder risk: confirmation in a family-based association study. Bipolar Disorder. 2009;11:1–10. doi: 10.1111/j.1399-5618.2008.00655.x. [DOI] [PubMed] [Google Scholar]
Browning B, Browning S. A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. The American Journal of Human Genetics. 2009;84(2):210–223. doi: 10.1016/j.ajhg.2009.01.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
Topol E, McCarthy J, Gabriel S, Moliterno D, Rogers W, Newby L, Freedman M, Metivier J, Cannata R, O'Donnell C, Kottke-Marchant K, Murugesan G, Plow E, Stenina O, Daley G. Single nucleotide polymorphisms in multiple novel thrombospondin genes may be associated with familial premature myocardial infarction. Circulation. 2001;104:2641–2644. doi: 10.1161/hc4701.100910. [DOI] [PubMed] [Google Scholar]
Zwicker J, Peyvandi F, Palla R, Lombardi R, Canciani M, Cairo A, Ardissino D, Bernardinelli L, Bauer K, Lawler J, Mannucci P. The thrombospondin-1 N700S polymorphism is associated with early myocardial infarction without altering von Willebrand factor multimer size. Blood. 2006;118(4):1280–1283. doi: 10.1182/blood-2006-04-015701. [DOI] [PMC free article] [PubMed] [Google Scholar]
McCarthy J, Meyer J, Moliterno D, Newby L, Rogers W, Topol E. Evidence for substantial effect modification by gender in a large-scale genetic association study of the metabolic syndrome among coronary heart disease patients. Human Genetics. 2003;114:87–98. doi: 10.1007/s00439-003-1026-1. [DOI] [PubMed] [Google Scholar]
Nejentsev S, Howson J, Walker N, Szeszko J. et al. Localization of type 1 diabetes susceptibility to the MHC class I genes HLA-B and HLA-A. Nature. 2007;450(6):887–892. doi: 10.1038/nature06406. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hodgkinson A, Millward B, Demaine A. The HLA-E locus is associated with age at onset and susceptibility to type 1 diabetes mellitus. Human Immunology. 2000;61(3):290–295. doi: 10.1016/S0198-8859(99)00116-0. [DOI] [PubMed] [Google Scholar]
Riper D, Jayakumar L, Latchana N, Bhoiwala D, Mitchell A, Valenti J, Crawford D. Regulation of vascular function by RCAN1 (ADAPT78) Archives of Biochemistry and Biophysics. 2008;472:43–50. doi: 10.1016/j.abb.2008.01.029. [DOI] [PubMed] [Google Scholar]
Duerr R, Taylor K, Brant S, Rioux J, Silverberg M, Daly M, Steinhart A, Abraham C, Regueiro M, Griffths A, Dassopoulos T, Bitton A, Yang H, Targan S, Datta L, Kistner E, Schumm L, Lee A, Gregersen P, Barmada M, Rotter J, DL N, Cho J. A genome-wide association study identifies IL23R as an inflammatory bowel disease gene. Science. 2006;314:1461–1463. doi: 10.1126/science.1135245. [DOI] [PMC free article] [PubMed] [Google Scholar]
Amos C, Chen W, Seldin M, Remmers E, Taylor K, Criswell L, Lee A, Plenge R, Kastner D, Gregersen P. BMC proceedings. Vol. 3. BioMed Central Ltd; 2009. Data for Genetic Analysis Workshop 16 Problem 1, association analysis of rheumatoid arthritis data; p. S2. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J, Lv J. Sure independence screening for ultra-high-dimensional feature space. Journal of the American Statistical Association: Series B. 2008;70:849–911. doi: 10.1111/j.1467-9868.2008.00674.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Moore J, White B. Tuning ReliefF for genomewide genetic analysis. Lecture Notes in Computer Science. 2007;4447:166–175. doi: 10.1007/978-3-540-71783-6_16. [DOI] [Google Scholar]
Wasserman L. All of statistics: a concise course in statistical inference. Springer Verlag; 2004. [Google Scholar]
Wang K, Dickson S, Stolle C, Krantz I, DB G, H H. Interpretation of association signals and identification of causal variants from genome-wide association studies. The American Journal of Human Genetics. 2010. [DOI] [PMC free article] [PubMed]
Hirschhorn J, Daly M. Genome-wide association studies for common diseases and complex traits. Nature Reviews Genetics. 2005;6(2):95–108. doi: 10.1038/nrg1521. [DOI] [PubMed] [Google Scholar]
Schork N, Murray S, Frazer K, Topol E. Common vs. rare allele hypotheses for complex diseases. Current opinion in genetics & development. 2009;19(3):212–219. doi: 10.1016/j.gde.2009.04.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
Weale M, Depondt C, Macdonald S, Smith A, Lai P, Shorvon S, Wood N, Goldstein D. Selection and evaluation of tagging SNPs in the neuronal-sodium-channel gene SCN1A: implications for linkage-disequilibrium gene mapping. The American Journal of Human Genetics. 2003;73:551–565. doi: 10.1086/378098. [DOI] [PMC free article] [PubMed] [Google Scholar]
Carlson C, Eberle M, Rieder M, Yi Q, Kruglyak L, Nickerson D. Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium. The American Journal of Human Genetics. 2004;74:106–120. doi: 10.1086/381000. [DOI] [PMC free article] [PubMed] [Google Scholar]
Agresti A. Categorical Data Analysis. second. Wiley Series in Probability and Statistics, Wiley and Sons INC; 2002. [Google Scholar]
Ma L, Runesha H, Dvorkin D, Garbe J, Da Y. Parallel and serial computing tools for testing single-locus and epistatic SNP effects of quantitative traits in genome-wide association studies. BMC Bioinformatics. 2009;9:315. doi: 10.1186/1471-2105-9-315. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Additional file 1

Click here for file^{(308.6KB, PDF)}

[B1] Balding D. A tutorial on statistical methods for population association studies. Nature Reviews Genetics. 2006;7:781–791. doi: 10.1038/nrg1916. [DOI] [PubMed] [Google Scholar]

[B2] Cordell H. Detecting gene-gene interactions that underlie human diseases. Nature Reviews Genetics. 2009;10:392–404. doi: 10.1038/nrg2579. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B3] Ritchie M, Hahn L, Roodi N, Bailey L, Dupont W, Parl F, Moore J. Multifactor-dimensionality reduction reveals high-order interactions among estrogenmetabolism genes in sporadic breast cancer. The American Journal of Human Genetics. 2001;69:138–147. doi: 10.1086/321276. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B4] Schwarz D, Kónig I, Ziegler A. On Safari to Random Jungle: A fast implementation of Random Forests for high dimensional data. Bioinformatics. 2010. in press . [DOI] [PMC free article] [PubMed]

[B5] Zhang Y, Liu J. Bayesian inference of epistatic interactions in case-control studies. Nature Genetics. 2007;39:1167–1173. doi: 10.1038/ng2110. [DOI] [PubMed] [Google Scholar]

[B6] Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira M, Bender D, Maller J, Sklar P, de Bakker P, Daly M, Sham P. PLINK: a tool set for whole-genome association and population-based linkage analyses. The American Journal of Human Genetics. 2007;81:559–575. doi: 10.1086/519795. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B7] Breiman L. Random Forests. Machine Learning. 2001;45:5–32. doi: 10.1023/A:1010933404324. [DOI] [Google Scholar]

[B8] Lunetta K, Hayward L, Eerdewegh PV. Screening large-scale association study data: exploiting interactions using random forests. BMC Genetics. 2004;5:32–44. doi: 10.1186/1471-2156-5-32. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B9] Bureau A, Dupuis J, Falls K, Lunetta K, Hayward B, Keith T, Van Eerdewegh P. Identifying SNPs predictive of phenotype using random forests. Genetic Epidemiology. 2005;28(2):171–182. doi: 10.1002/gepi.20041. [DOI] [PubMed] [Google Scholar]

[B10] Wan X, Yang C, Yang Q, Xue H, Fan X, Tang N, Yu W. BOOST: A fast approach to detecting gene-gene interactions in genome-wide case-control studies. The American Journal of Human Genetics. 2010;87(3):325–340. doi: 10.1016/j.ajhg.2010.07.021. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B11] Wasserman L, Roeder K. High-dimensional variable selection. The Annals of Statistics. 2009;37(5A):2178–2201. doi: 10.1214/08-AOS646. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B12] Spirtes P, Glymour C, Scheines R. Causation, Prediction, and Search. MIT Press; 2001. [Google Scholar]

[B13] Tibshirani R. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society, series B. 1996;58:267–288. [Google Scholar]

[B14] Wu T, Chen Y, Hastie T, Sobel E, Lange K. Genomewide Association Analysis by Lasso Penalized Logistic Regression. Bioinformatics. 2009;25(6):714–721. doi: 10.1093/bioinformatics/btp041. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B15] Hoggart C, Whittatker J, Iorio M, Balding D. Simultaneous Analysis of All SNPs in Genome-wide and Re-Sequencing Association Studies. PLoS Genetics. 2008;4(7):e1000130. doi: 10.1371/journal.pgen.1000130. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B16] Yang C, Wan X, Yang Q, Xue H, Yu W. Identifying main effects and epistatic interactions from large-scale SNP data via adaptive group Lasso. BMC Bioinformatics. 2010;11(Suppl 1):S18. doi: 10.1186/1471-2105-11-S1-S18. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B17] WTCCC. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007;447:661–678. doi: 10.1038/nature05911. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B18] Guo Z, Hood L, Malkki M, Petersdorf E. Long-range multilocus haplotype phasing of the MHC. PNAS. 2006;103(18):6964–6969. doi: 10.1073/pnas.0602286103. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B19] Xu C, Li P, Cooke R, Parikh S, Wang K, Kennedy J, Warsh J. TRPM2 variants and bipolar disorder risk: confirmation in a family-based association study. Bipolar Disorder. 2009;11:1–10. doi: 10.1111/j.1399-5618.2008.00655.x. [DOI] [PubMed] [Google Scholar]

[B20] Browning B, Browning S. A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. The American Journal of Human Genetics. 2009;84(2):210–223. doi: 10.1016/j.ajhg.2009.01.005. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B21] Topol E, McCarthy J, Gabriel S, Moliterno D, Rogers W, Newby L, Freedman M, Metivier J, Cannata R, O'Donnell C, Kottke-Marchant K, Murugesan G, Plow E, Stenina O, Daley G. Single nucleotide polymorphisms in multiple novel thrombospondin genes may be associated with familial premature myocardial infarction. Circulation. 2001;104:2641–2644. doi: 10.1161/hc4701.100910. [DOI] [PubMed] [Google Scholar]

[B22] Zwicker J, Peyvandi F, Palla R, Lombardi R, Canciani M, Cairo A, Ardissino D, Bernardinelli L, Bauer K, Lawler J, Mannucci P. The thrombospondin-1 N700S polymorphism is associated with early myocardial infarction without altering von Willebrand factor multimer size. Blood. 2006;118(4):1280–1283. doi: 10.1182/blood-2006-04-015701. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B23] McCarthy J, Meyer J, Moliterno D, Newby L, Rogers W, Topol E. Evidence for substantial effect modification by gender in a large-scale genetic association study of the metabolic syndrome among coronary heart disease patients. Human Genetics. 2003;114:87–98. doi: 10.1007/s00439-003-1026-1. [DOI] [PubMed] [Google Scholar]

[B24] Nejentsev S, Howson J, Walker N, Szeszko J. et al. Localization of type 1 diabetes susceptibility to the MHC class I genes HLA-B and HLA-A. Nature. 2007;450(6):887–892. doi: 10.1038/nature06406. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B25] Hodgkinson A, Millward B, Demaine A. The HLA-E locus is associated with age at onset and susceptibility to type 1 diabetes mellitus. Human Immunology. 2000;61(3):290–295. doi: 10.1016/S0198-8859(99)00116-0. [DOI] [PubMed] [Google Scholar]

[B26] Riper D, Jayakumar L, Latchana N, Bhoiwala D, Mitchell A, Valenti J, Crawford D. Regulation of vascular function by RCAN1 (ADAPT78) Archives of Biochemistry and Biophysics. 2008;472:43–50. doi: 10.1016/j.abb.2008.01.029. [DOI] [PubMed] [Google Scholar]

[B27] Duerr R, Taylor K, Brant S, Rioux J, Silverberg M, Daly M, Steinhart A, Abraham C, Regueiro M, Griffths A, Dassopoulos T, Bitton A, Yang H, Targan S, Datta L, Kistner E, Schumm L, Lee A, Gregersen P, Barmada M, Rotter J, DL N, Cho J. A genome-wide association study identifies IL23R as an inflammatory bowel disease gene. Science. 2006;314:1461–1463. doi: 10.1126/science.1135245. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B28] Amos C, Chen W, Seldin M, Remmers E, Taylor K, Criswell L, Lee A, Plenge R, Kastner D, Gregersen P. BMC proceedings. Vol. 3. BioMed Central Ltd; 2009. Data for Genetic Analysis Workshop 16 Problem 1, association analysis of rheumatoid arthritis data; p. S2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B29] Fan J, Lv J. Sure independence screening for ultra-high-dimensional feature space. Journal of the American Statistical Association: Series B. 2008;70:849–911. doi: 10.1111/j.1467-9868.2008.00674.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B30] Moore J, White B. Tuning ReliefF for genomewide genetic analysis. Lecture Notes in Computer Science. 2007;4447:166–175. doi: 10.1007/978-3-540-71783-6_16. [DOI] [Google Scholar]

[B31] Wasserman L. All of statistics: a concise course in statistical inference. Springer Verlag; 2004. [Google Scholar]

[B32] Wang K, Dickson S, Stolle C, Krantz I, DB G, H H. Interpretation of association signals and identification of causal variants from genome-wide association studies. The American Journal of Human Genetics. 2010. [DOI] [PMC free article] [PubMed]

[B33] Hirschhorn J, Daly M. Genome-wide association studies for common diseases and complex traits. Nature Reviews Genetics. 2005;6(2):95–108. doi: 10.1038/nrg1521. [DOI] [PubMed] [Google Scholar]

[B34] Schork N, Murray S, Frazer K, Topol E. Common vs. rare allele hypotheses for complex diseases. Current opinion in genetics & development. 2009;19(3):212–219. doi: 10.1016/j.gde.2009.04.010. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B35] Weale M, Depondt C, Macdonald S, Smith A, Lai P, Shorvon S, Wood N, Goldstein D. Selection and evaluation of tagging SNPs in the neuronal-sodium-channel gene SCN1A: implications for linkage-disequilibrium gene mapping. The American Journal of Human Genetics. 2003;73:551–565. doi: 10.1086/378098. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B36] Carlson C, Eberle M, Rieder M, Yi Q, Kruglyak L, Nickerson D. Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium. The American Journal of Human Genetics. 2004;74:106–120. doi: 10.1086/381000. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B37] Agresti A. Categorical Data Analysis. second. Wiley Series in Probability and Statistics, Wiley and Sons INC; 2002. [Google Scholar]

[B38] Ma L, Runesha H, Dvorkin D, Garbe J, Da Y. Parallel and serial computing tools for testing single-locus and epistatic SNP effects of quantitative traits in genome-wide association studies. BMC Bioinformatics. 2009;9:315. doi: 10.1186/1471-2105-9-315. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

A hidden two-locus disease association pattern in genome-wide association studies

Can Yang

Xiang Wan

Qiang Yang

Hong Xue

Nelson LS Tang

Weichuan Yu

Abstract

Background

Results

Conclusions

Availability

Background

Figure 1.

Results

Experiment on simulation study

Figure 2.

Experiment on seven data sets from WTCCC

Table 1.

Table 2.

Table 3.

Bipolar disorder (BD)

Figure 3.

Figure 4.

Coronary artery disease (CAD)

Figure 5.

Type 1 diabetes (T1D)

Hypertension (HT)

Crohn's disease (CD), rheumatoid arthritis (RA) and type 2 diabetes (T2D)

Experiment on the Illumina data sets from other independent studies

Discussion

The unfaithfulness issue in the high dimensional feature space

The Relationship between interaction models and unfaithfulness

The Relationship between unfaithfulness and confounding

Biological interpretations

The unfaithfulness implications on tagSNPs

Conclusion

Methods

Definition of the association masked by unfaithfulness

Likelihood ratio tests using log-linear models

Table 4.

Table 5.

Table 6.

An exhaustive approach to detecting the two-locus associations masked by unfaithfulness in GWAS

Simulation design

Table 7.

Authors' contributions

Supplementary Material

Contributor Information

Acknowledgements

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases