Abstract
Objective:
The conventional Affected Sib Pair methods evaluate the linkage information at a locus by considering only marginal information. We describe a multilocus linkage method that uses both the marginal information and information derived from the possible interactions among several disease loci, thereby increasing the significance of loci with modest effects.
Methods:
Our method is based on a statistic that quantifies the linkage information contained in a set of markers. By a marker selection-reduction process, we screen a set of polymorphisms and select a few that seem linked to disease.
Results:
We test our approach on a genome-scan data for inflammatory bowel disease (InfBD) and on simulated data. On real data we detect 6 of the 8 known InfBD loci; on simulated data we obtain improvements in power of up to 40% compared to a conventional single-locus method.
Conclusion:
Our extensive simulations and the results on real data show that our method is in general more powerful than single-locus methods in detecting disease loci responsible for complex traits. A further advantage of our approach is that it can be extended to make use of both the linkage and the linkage disequilibrium between disease loci and nearby markers.
Keywords: Genetic Screening, Affected Sib Pairs, Multilocus Linkage Method
1 Introduction
Traditional approaches to linkage analysis assign a score to each marker position by considering the linkage information given by that marker or a few nearby markers. These approaches have been very successfully applied to Mendelian diseases; however they have been less fruitful in the context of complex diseases. Because complex genetic diseases are caused by the action of several genes that can interact in a complicated manner, methods that can exploit interactions among multiple disease loci are expected to be more powerful. Here we report a novel linkage method for affected sib pairs (ASPs). Our approach screens a large number of polymorphisms and selects a few that appear to be linked to disease genes. The selection is based on an importance score assigned to each marker based on both marginal information as well as information coming from possible interactions among several disease loci.
Several multilocus linkage methods have been reported in the literature. These include model-based methods and model-free methods. The model-based methods calculate the full likelihood of disease and marker data under the assumed mode of inheritance (usually two-locus models, Schork et al. [1]). The model-free methods are based on comparing the observed allele-sharing among relatives that are phenotypically alike with that expected under no linkage. Their main characteristic is that they do not assume a specific mode of inheritance. Examples include Cordell et al. [2] and Farrall [3] for ASPs. Their methods are based on computing a maximum likelihood statistic (MLS) and are restricted to two-locus models. More recently, Cordell et al. [4] have presented a generalization of the MLS method in Cordell et al. [2] to several disease loci and affected relative pairs. Given linkage evidence at m − 1 loci, the evidence at the mth locus is measured by the difference in MLS between the best fitting m − 1 locus model and the best fitting m locus model. However, due to the sparseness of the data when m increases and the large number of parameters that need to be estimated in their model fitting procedure, the method is useful in practice only for the simultaneous analysis of at most ∼ 3 disease loci. Also these methods are applicable only after a primary genome-screen has already been performed, when the number of loci under investigation is small.
Here we report a new screening method. Our method works on datasets with large number of markers and makes no assumption on the disease model, including the number of disease loci and their position in the genome. This new approach uses the interactions among several disease loci to help increase the importance of moderate effect disease loci relative to other noisy loci. The method is based on the repetition of a two-phase selection-reduction process. In the first step (“selection”) we select a small set of markers at random from the available list of polymorphisms. In the next step (“reduction”), we remove the unimportant markers from the current set one by one in a stepwise fashion until all the remaining markers are important or a single marker remains (we call these markers “returned”). At the end of this process we count how many times each marker was returned. Based on these counts we decide which markers are returned at significantly high frequency. The key technical aspect of this procedure is the definition of a statistic to measure the relative importance of a marker in the current set.
We apply this new approach to real data (Inflammatory Bowel Disease) as well as data simulated under several complex models. The results are very good. On the real data we confirm most of the known loci. On simulated data we show that our method is consistently more powerful than the single-locus methods currently in use.
The rest of the paper is organized as follows. In Section 2 (Methods) we illustrate the theoretical aspects of our approach. In Section 3 (Results) we present our findings on a real dataset for Inflammatory Bowel Disease and on simulated data. We conclude in Section 4 (Discussion) with a discussion of our findings.
2 Methods
2.1 Linkage Measure
The core of our approach is the definition of a linkage measure for a set of markers. In this section we describe this measure.
Notation 2.1.
Most model-free methods for ASPs work with the genotypic identical-by-descent (IBD) sharing at a locus, which can be 0, 1 or 2. In our approach we work with the allelic IBD status; in this case the IBD sharing can be 0 or 1, meaning the number of alleles a sibpair shares IBD transmitted from one of the parents. If the marker is not linked to disease, then the IBD sharing is 0 or 1 allele with equal probability 0.5. For several loci we define an IBD sharing vector, such that the ith component represents the sharing at the ith locus. For example, the IBD sharing vector 111 for three loci signifies that the sibpair shares 1 allele IBD at each of the three loci. Let be the number of such sharing vectors in the dataset at three loci i, j, k.
Let S = {M1, M2, … , Mk} be a set of markers under evaluation. Then the measure is defined as:
| (1) |
where
| (2) |
The weight wk is chosen such that when none of the markers in the set S is linked to disease, we have:
The rationale for this is that when S contains only unlinked markers, the linkage measure should remain constant when any marker is removed from the set (no drop or increase in the linkage measure). We assume further that the k markers in S are not linked among themselves. Under these assumptions: .
Then we can write:
Since has a multinomial distribution with parameters N (twice the number of ASPs) and , and we have:
Considering this it is easy to see that:
hence the resulting weight wk.
It is revealing to rewrite the linkage measure as follows:
| (3) |
Then for k ≥ 4 we have wk ≈ 1 and wk ≈ wk−1. Hence we can write:
| (4) |
Essentially our measure is defined recursively as follows. We start with the natural NPL-like measure for one marker H1 = 2(n1 − n0)2. The measure for k markers (H1…k) is obtained as the average of the measures for all possible k combinations of k − 1 markers: H2…k, H13…k, … , H1…k− plus an additional term that measures the interaction of all k markers together.
Notice that when none of the k markers is linked to disease we have: . Thus the interaction term tends to be small in this case (O(N)). However when all k markers are linked to disease, this term will become large (due to ).
Remark 2.2.
Our experiments show that under the assumption of no specific interaction model (e.g. epistasis or heterogeneity), the other possible pieces of information that we could use in the definition of the measure ( e.g. n10 , n01, etc.) may introduce noise (e.g. in the case of disease loci that interact epistatically). Certainly n10 and n01 contain information in a two-locus heterogeneity model, but the choice of a consistent statistic that would work for different scenarios forces us to disregard these terms and instead focus on n11 and n00. Notice that under both the epistatic and the heterogeneity interaction model for two disease loci E(n11 − n00) > 0, whereas when none of the loci is linked to disease E(n11 − n00) = 0
2.2 Screening Algorithm
The screening procedure consists of a marker selection-reduction process described below. Suppose we have a list of many markers (hundreds in a whole-genome study). We proceed as follows:
Step 0 Repeat steps 1 – 4 B times (B ≥ 3000 is a fairly large number).
Step 1 Start by choosing a set of k ≈ 10 markers at random from the available list of markers.
- Step 2 At each step compute for each marker in the current set the resulting change in the linkage measure when that marker is removed. For marker i:
If Δi < 0, then the linkage measure decreases when removing marker i and therefore marker i is important relative to the other markers present in the current marker set. If Δi > 0, then the linkage measure increases when removing marker i and therefore marker i is not important relative to the other markers present. Step 3 Remove the marker i (if any) with the largest positive Δi from the current set.
Step 4 Do Steps 2 – 3 until either all the markers in the current set are important (all Δi are negative) or only one marker remains. The returned markers are recorded.
Step 5 We compute for each marker a final return count denoting the total number of times it was returned in Step 4. Based on these counts we separate the markers into two classes: the important/linked to disease markers and the unimportant/unlinked ones. The details of this statistical procedure are given in Section 2.4.
2.3 Why It Works
The behavior of the screening algorithm in Section 2.2 depends heavily on the properties of the statistic H1…k. We formulate these properties in the lemma below. The main idea is that in expectation only markers that are linked to disease are returned in Step 4 and markers that are not linked tend to be removed in Step 3. Let S = {1, … , k} be the current set. For the lemma below we make the simplifying assumption that the k markers are not linked among themselves.
Lemma 2.3.
The following properties are true:
- If none of the markers is linked to disease, then for any marker i in S we have
- If S contains one marker linked to disease (without loss of generality, assume this is the first marker) and the rest are unlinked, then for any unlinked marker u in S we have:
- If the set S has some interacting markers, linked to disease, of similar relative importance and some unlinked markers, then for any linked marker l and any unlinked one u we have:
- If the current set S contains only markers linked to disease that are of similar relative importance and also have non-negligible interaction, then for any marker l in S
Proof
See Appendix.
Remark 2.4.
We made the assumption that the k selected markers in Step 1 of the Screening Algorithm are unlinked among themselves. Given that in the majority of cases the k (≈ 10) markers chosen at random from a large number of markers are unlinked among themselves and also because of ease of computation, that assumption is reasonable. However, even when some of the markers in the current set are linked, the effect tends to be very small (computations not shown), and the screening algorithm behaves as desired.
To better illustrate these properties, we simulated a small dataset with 7 markers. The first two of these are each closely linked (θ = 0.01) to a different disease gene. The other five are unlinked to disease. The disease model is epistatic RR, i.e. two mutations at each of the two disease loci are necessary to have disease. As shown in Figure 1a, the measure H12347 decreases significantly when removing either one of the linked markers (1 or 2) and increases significantly when removing either of the unlinked markers (3, 4 or 7). In Figure 1b we see that when none of the markers in the current set is linked to disease, the values of the measure are small and not as well separated as the ones in Figure 1a. In fact a random (unlinked) marker is removed. In this example (Figure 1b), marker 3 is removed (i.e. Δ3 = H4567 − H34567 is the largest positive Δi).
Figure 1.
Linkage measure. Figure (a) and (b) illustrate the behavior of the measure for 5 markers when trying to remove each one of them in turn.
Note that we have:
Therefore according to (4) in Section 2.1 the interaction term ((n11111 − n00000)2) is small in this case due to the presence of unlinked markers in the current set.
2.4 Important vs. Unimportant Markers
The goal of our method is to separate the important/linked to disease markers from the unimportant/unlinked markers. We present two different methods to achieve this goal. Both methods yield a good balance between false positive results and true positive results. In our experience the two methods behave similarly.
- Normal-Mixture Method
- We first fit a two-component normal-mixture model to the histogram of return counts:
where μ2 > μ1 and p2 = 1 − p1; μ2 and μ1 are the means for the distribution of important and unimportant markers respectively. To control the false-positive rate (FPR), we select as threshold the 1 − α percentile for the unimportant markers at a certain level α. The markers that have a return higher than this cutoff are claimed to be important (linked to disease genes).
- Efron's Method
- Another method to achieve this separation is based on an idea of Efron [5]. He proposes a method to divide the data values into two classes, interesting and uninteresting, when a large number of tests need to be evaluated as is the case in whole-genome scans. This is in contrast to the classical significant versus non-significant categorization used when the number of tests is small. The method first fits a natural spline to the histogram of return counts by Poisson regression. We call this curve: f (mixture density). Also an empirical null distribution is estimated, denoted by f0 (empirical null density). Then for each marker M the local false discovery rate is defined as:
Controlling the false discovery rate suggests that the markers with locfdr< α be declared interesting (for a certain level α).
2.5 Choice of B and k
As explained in Section 2.2, our screening algorithm repeats B times the process of random selection of k markers and then evaluation of each of the markers in that set. We want to choose B and k large enough such that we get as clear a separation between the markers linked to disease and the unlinked ones as possible. We present a heuristic derivation of a formula for B in the Appendix. The formula predicts conservatively that for 200 markers B should be about 8000 and for 500 markers B ≈ 20000. The size of k influences the number of times certain markers are chosen together in the random subset. It shouldn't be too small, since we want a good probability to select markers together. On the other hand, due to the sparseness of the data in large dimensions and also due to computational issues, k should not be too large. In our experience k = 10 works well.
3 Results
We evaluated our method on both simulated data and real data.
3.1 Application to Simulated Data
We applied our method to two complex disease models.
3.1.1 First Simulated Disease Model
In the first disease model there are 9 unlinked disease loci. The disease is present when at least 5 of the 9 disease genes are mutated. The sample contains 200 ASPs genotyped at 50 markers, with 20% of the data sporadic (diseased because of nongenetic causes). Nine markers out of the total of 50 are linked to disease genes (θ = 0.05), one marker for each disease gene. The rest are independent markers, not linked to disease and among themselves. The disease gene frequencies are all set to 0.05 and the marker frequencies are all 0.5. We assume we have complete data: the inference of the IBD sharing is without ambiguity. For each marker we compute two statistics:
- - the single-locus statistic (the ASP mean test):
where n1 (n0) is the number of 1 (0) IBD sharing at that particular marker. - the return count computed by the proposed method (B = 3000 and k = 10 in our screening procedure).
For each of the two methods we report the number of loci above certain significance thresholds: {1%, 2%, … , 10%} false positive rates. Since in the simulated data we know exactly which markers are linked to disease and which are unlinked, we can approximate the threshold corresponding to a specific false positive rate empirically by simulation.
Figure 5 shows an example of a simulated dataset according to the complex model outlined above. The horizontal lines represent the thresholds for the 1%, 2%, 5% and 10% FPR. It illustrates the advantage of our method; because the markers linked to disease are returned together in Step 4 of the screening algorithm in Section 2.2, they will separate better from the unlinked markers. Therefore the proposed method can be very powerful in increasing the importance of disease loci of moderate effect by making use of interactions among disease loci.
Figure 5.

Average percentage of disease loci discovered with the single-locus method and the modified multilocus method while controlling the FPR (a) and Sample Size Comparison (b) for the 4–locus disease model with LD. It illustrates the power of our method to take into account small levels of linkage disequilibrium between markers and disease loci.
To investigate the power, we generated 600 independent replicates. Figure 3a depicts the average percentage of disease loci selected by each of the two methods while keeping the false positive rate at the {1%, 2%, … , 10%} level. Our method is more powerful than the single-locus method at all levels. At the 1% significance level, our method discovers on average 3.1 of the 9 disease loci, while the single-locus method finds only 2.2 loci. Similarly at the 3% level we detect on average 4.5 loci, while the single-locus method finds 3.4 loci.
Figure 3.

Average percentage of disease loci discovered with the single-locus method and the new multilocus method (a) and Sample Size Comparison (b) for the 9–locus disease model. Figure (a) illustrates the fact that the new method outperforms the single-locus method at all significance levels. Figure (b) illustrates the increase in sample size necessary for the single-locus method to achieve similar power to that of the multilocus method.
Finally, we compared the increase in sample size necessary for the single-locus method to achieve similar power to that of the multilocus method. The results are depicted in Figure 3b. For this particular model an increase in sample size of over 20% is necessary.
3.1.2 Second Simulated Disease Model
We also simulated a similar disease model with 4 disease loci. Now the disease is present when at least 2 of the 4 disease genes are mutated.
Figure 4a depicts the average percentage of disease loci selected by each of the two methods while keeping the false positive rate at the {1%, 2%, … , 10%} level. As we can see, our method is more powerful than the single-locus method at all significance levels. In Figure 4b we illustrate the increase in sample size necessary for the single-locus method to attain similar power to that of the multilocus method. For this simpler disease model, a 10% – 15% increase is necessary.
Figure 4.

Average percentage of disease loci discovered with the single-locus method and the new multilocus method (a) and Sample Size Comparison (b) for the 4–locus disease model. Figure (a) illustrates the fact that the new method outperforms the single-locus method at all significance levels. Figure (b) illustrates the increase in sample size necessary for the single-locus method to achieve similar power to that of the multilocus method.
We then repeated the same simulations, but this time we introduced small linkage disequilibrium (LD) levels between some of the disease genes and the nearby linked markers. Namely, δ1 = δ2 = 0.5 and δ3 = δ4 = 0 where δ is the normalized LD measure. We compared the single locus approach to a modified version of our multilocus linkage method (see Appendix) that can also take advantage of mild linkage disequilibrium between disease loci and nearby markers. In this case the results (Figure 5) are even better compared to the ones obtained in Figure 4 where no linkage disequilibrium was present. The improvement at the 1% FPR is 23% and an increase in sample size of over 25% is necessary for the single locus linkage method to achieve similar performance as the modified multilocus linkage method.
3.2 Application to Real Data (Inflammatory Bowel Disease)
We also analyzed a real dataset for Inflammatory Bowel Disease (InfBD) using our method. InfBD consists of two disorders: Crohn's Disease (CD) and Ulcerative colitis (UC). They are both inflammatory disorders of the gastrointestinal tract with a strong genetic contribution as revealed by epidemiological studies. Genome-wide searches for InfBD susceptibility loci have identified several regions of interest, showing that InfBD is a complex genetic disease caused by the action of several genes.
The present dataset is a genome screen of 106 ASPs (including parents) from Canada, affected with CD genotyped at 457 microsatellite markers; the average marker spacing is ∼ 10cM. These data have been previously analyzed in Rioux et al. [7] and in Lo and Zheng [8].
In order to apply our new approach to these data, we first inferred the IBD (identity-by-descent) sharing probabilities for each ASP using the program GENEHUNTER 2.0 (Daly et al. [9]). Since our method requires complete IBD sharing information, we probabilistically impute the IBD sharing values. More exactly, for each sib pair under study we generate the IBD value at each position in the sharing vector according to the corresponding sharing probabilities (as calculated by GENEHUNTER 2.0); for example if the sharing probabilities at a certain position are (0.2, 0.5, 0.3) for sharing 0, 1 and 2 alleles respectively, then we generate the IBD value 0, 1 or 2 according to this distribution. In order to minimize the bias due to these probabilistic imputations, we do it 100 times, each time generating a new dataset.
We applied our algorithm on each of the 100 generated datasets. We use B = 20000 and k = 10 in our screening procedure. The return counts (averaged over the 100 datasets) for all markers together with the fitted two-component normal mixture are depicted in Figure 6b. By controlling the false positive rate at a stringent level we obtain that markers with return count above 240 should be reported as important. In figure 6a we depict the cumulative distribution function (CDF) for a single normal fitted to the data versus the CDF for a mixture of two normals. We also applied Efron's approach and by using a 1% cutoff for the local false discovery rate, we obtain similar results (see Appendix).
Figure 6.

Normal Mixture Approximation and Histogram of Return Counts. This figure depicts the histogram of the observed return counts for the InfBD data, together with a mixture of two normals fitted to the histogram. The normal mixture approximation represents one way to separate the important from the unimportant markers.
In Figure 7 we show the return counts plotted versus marker locations in the genome. The mean return count is 140 and is marked by a horizontal solid line. The threshold for declaring a marker important is 240 and is marked by a broken line.
Figure 7.

Results of the multilocus linkage method on the InfBD data. Return counts are plotted against marker positions in the genome. The broken line represents the threshold for the return counts necessary to control the false positive rate at a stringent level. Markers with return count above this threshold are considered important.
The results we obtain are extremely significant. We validated 6 (IBD1, IBD3, IBD5, IBD6, IBD7, IBD8) of the 8 known InfBD loci. Additionally we found several other interesting regions.
The region 1q21 contains a cluster of genes influencing epidermal differentiation. This region is linked to other inflammatory diseases, e.g. psoriasis; psoriasis can occur in association with Inflammatory Bowel Disease (Crohn's disease), suggesting that they may share common genetic risk factors.
The locus on chromosome 2p11 (D2S1790) is located ∼ 10 cM from the gene IL1R1 (interleukin 1 receptor, type 1). There is evidence for the activation of the mucosal immune system and the production of inflammatory cytokines, i.e. interleukin (IL)-1ra and IL-1beta, in the Inflammatory Bowel Disease (Heresbach et al. [10]).
The region 2q32 harbors the STAT1 and STAT4 genes (signal transducers and activators of transcription), which are candidate genes for Inflammatory Bowel Disease (Barmada et al. [11]).
The locus on chromosome 3p: suggestive linkage in this region was found in Rioux et al. [7].
The locus on chromosome 7p13: gene IGFBP3 (insulin-like growth factor binding protein 3) maps to this region. Katsanos et al. [12] found that the serum IGFBP3 levels are reduced in patients with Inflammatory Bowel Disease.
The locus on chromosome 21q22.2 (D21S1809) is close to the TFF1 and TFF2 (trefoil factor 1 and 2) genes. These genes, located on 21q22.3, are expressed in the gastrointestinal mucosa. Increased levels of TFF1 and TFF2 have been found in serum from Inflammatory Bowel Disease patients (Vestergaard et al. [13]).
In the Appendix we give the table with all selected markers and their chromosomal position.
4 Discussion
We presented a new model-free multilocus linkage method for affected sib pairs. Our approach selects from a large number of polymorphisms a small number that appear to be linked to disease. No assumption is made on the disease model, including number of disease loci or their positions in the genome. It uses both the marginal linkage information as well as information coming from the possible interaction among several disease loci to boost the significance of modest single-locus effects. A further advantage of our method is that it can be naturally extended to take into consideration small linkage disequilibrium levels between disease loci and nearby markers, thereby gaining even greater increases in power over single locus linkage methods. Further details on this method will appear in a paper under preparation.
We evaluated our method on both simulated data and real data. The extensive simulations that we did show consistently that the proposed approach is more powerful than the conventional single-locus linkage methods at all significance levels (up to 40% increase in power). The improvement in power increases when the number of interacting disease loci increases. In the absence of interactions our method performs similarly to the single-locus methods. Also the results on the real data are highly significant. We validated 6 of the 8 known InfBD loci and also found a few interesting loci, some of which have been already implicated in Inflammatory Bowel Disease pathogenesis.
Our method is also very general; the disease loci can be anywhere in the genome (possibly on different chromosomes) and they can interact in complex, unknown ways. We did make a simplifying assumption, namely we assumed that the selected markers in the current set are unlinked among themselves. It is clear however that the effect of linkage between two unimportant markers is superseded by the presence in the current set of a marker linked to disease. This point is best illustrated on real data, where we see that markers close together do not tend to have return counts higher than expected (e.g. chromosome 4 in Figure 7).
The software implementing the proposed methods is available from the authors upon request. A complete package will be available online soon.
Given the complex nature of the common diseases and the many challenges in genomewide scans, we believe that our approach is very relevant; by using both the marginal and the interaction information, our method performs better than the traditional single-locus methods. Also due to its generality, the proposed method is applicable to a large number of situations.
Figure 2.

Comparison between a simple single-locus method (a) and our new method (b) on a complex disease model with 9 disease loci. The figure illustrates how the multilocus approach can increase the significance of moderate effect loci.
Acknowledgments
We thank Tian Zheng for many useful discussions. We also thank Eric Lander and Mark Daly from the Whitehead Institute for Biomedical Research for providing the data on Inflammatory Bowel Disease. This research is partially supported by the National Institutes of Health (NIH) (grant RO1 GM070789 – 01).
Appendices
A Proof of Lemma 2.3
Let S = {1, … , k}. Assume the k markers in set S are unlinked among themselves.
Lemma 2.3.
The following properties are true:
- If none of the markers in the current set S is linked to disease, then for any marker i in S we have
- If the current set S contains one marker linked to disease (without loss of generality, assume this is the first marker) and the rest are unlinked, then for any unlinked marker u in S we have:
- If the set S has some interacting markers, linked to disease, of similar relative importance and some unlinked markers, then for any linked marker l and any unlinked one u we have:
- If the current set S contains only markers linked to disease that are of similar relative importance and also have non-negligible interaction, then for any marker l in S
Proof
1. The first part is easy; we have chosen the weights in Section 2.1 such that when no marker is linked to disease, we have:
From (3) in Section 2.1 we can write:
It suffices to show that
(we also use wk < wk−1). We prove the first inequality, namely . Since markers 2, … , k are not linked to disease and among themselves, one can easily show that E[H2…k] = 2N where N is twice the number of ASPs. Therefore we need to show that
Now we have:
| (5) |
where is the probability of the IBD sharing vector 1 … 1 at loci 1 … k. Therefore we showed the first inequality.
For the second inequality in , we proceed as follows. By definition in Section 2.1 we have:
| (6) |
It suffices to show that
| (7) |
Using (5) it is easy to prove that for any unlinked marker and for a linked marker. Also
where N is twice the number of ASPs. Hence what we need to prove is that:
If we let p1 = rp0 with r > 1 and since p1 + p0 = 1 we obtain:
The latter inequality is true for N (twice the number of ASPs) large enough. For example when and k = 2, a sample of 60 ASPs is sufficient. Hence we have shown that (7) is true. Now we can complete the proof of the second inequality in (*). If marker 1 is linked to disease and marker u is not linked, then we have:
Using the definitions of H2…k and H1…u−1 u+1…k in (6) and together with the inequalities above we obtain the second inequality in (*). This completes our proof.
3. We assume k ≥ 4 (k = 2 and k = 3 can be proved using case-by-case computations). We assume the first t markers are linked to disease and the rest (k − t) (> 0) are unlinked.
Let E[H1…i−1 i+1…k] = A for any i ≤ t and E[H1…i−1 i+1…k] = B for any i > t. Clearly, A < B. We now use the approximation in Section 2.1:
From this we have:
and since B > A we obtain:
Similarly:
We show that:
| (8) |
and therefore:
From (6) we can write:
| (9) |
With (8) and (9), we need to show:
| (10) |
Now:
where we used the fact that the last k − t markers are not linked to disease and among themselves (hence ). From this and (10) follows that we need to prove that:
1. If t is small compared with k (i.e. k − t is large) then since and it is sufficient to show that:
This inequality is similar to inequality (**) shown at point 2. of the lemma for the case t = 1. For N large enough and when t is small compared with k it is true.
2. If t is comparable to k and since k ≥ 4 (from our assumption), then tend to be much smaller than (say conservatively , ) and also
For example, if t = k − 1 and then the inequality above is
which is true (k ≤ 10). For t smaller than k − 1 the inequality is even sharper. This concludes our proof for t (the number of markers linked to disease) between 2 and k − 1. Next we show the case t = k.
4. We prove the case k ≥ 4. The cases k = 2 and k = 3 can be verified easily through direct case-by-case computations. We use the approximation in Section 2.1 and since the interacting disease loci have similar importance we obtain:
Since all k markers in the current set are assumed to be linked to disease and to interact together in a non-negligible fashion, is large enough to guarantee the inequality:
B Choice of B
A heuristic approach to estimating B was given in Lo and Zheng [6]. The derivation of B for the proposed method is very similar and we present the main ideas below.
Suppose M is a marker linked to disease in the original set of polymorphisms. Let p1 be the probability that a marker linked to disease is selected and returned in any single repetition (out of B) of the selection-reduction process; p0 is the same probability but for markers not linked to disease. Assume p1 = rp0 with r > 1. Let X be the observed return count for marker M. Then . In order to clearly separate the markers linked to disease from the ones not linked, we require: . After some algebra, this can be written equivalently as:
We can estimate p0, the probability for a marker not linked to disease to be selected and returned in a single repetition, as follows:
where n is the total number of markers and k is a small number of markers (say 10) selected to be evaluated; we assume conservatively that the probability that a selected marker, not linked to disease, is returned is .
Therefore we obtain
r can be written as:
where 1 − ε is an estimate for the probability that a linked marker, once selected, is returned. If conservatively we take r = 2 we obtain B ≈ 41(n − 1).
C Efron's Approach to Separating the Important from the Unimportant Markers
We also applied Efron's method (Efron [5]) for separating the set of markers into two categories: important vs unimportant. In the figure below we give the results. On the left hand side, the histogram of the return counts together with the fitted empirical null density f0(z) and the mixture density f(z) are depicted. On the right hand side, the localfdr (local false discovery rate) plot is added to the histogram (scaled up by a factor of 50). A return count of 242 corresponds to a local fdr of 1%.
D Extension of the Multilocus Linkage Method
We give a natural extension of the multilocus linkage method so that mild linkage disequilibrium (LD) levels between disease loci and marker loci can be used in addition to linkage to obtain even greater increases in power over single-locus linkage methods. The extension is based on combining the multilocus linkage method with a similar association method (the BHTA algorithm, Lo and Zheng [6]). They are both based on the screening procedure in Section 2.2. In what follows we denote by ΔLi (called Δi in the main text) and ΔLDi (defined in Lo and Zheng [6]) the change in linkage and association information respectively when removing marker i from the current set.
Step 0 Repeat steps 1 – 4 B times.
Step 1 Start by choosing a set of k ≈ 10 markers at random from the available list of markers.

- Step 2 At each step compute for each marker in the current set the resulting change in both the linkage and association measure respectively when that marker is removed. For marker i:
Step 3 Remove the marker i (if any) such that both ΔLi > 0 and ΔLDi > 0 (i.e. both the linkage measure and the association measure deem marker i unimportant) and that has the largest ΔLi + ΔLDi.
- Step 4 Do Steps 2–3 until either all the markers in the current set are important (for each remaining marker i not both ΔLi and ΔLDi are positive) or only one marker remains. We return marker i Ri times, depending on the linkage and the association evidence as follows:
where 1ΔLDi<0 is the indicator random variable for the event ΔLDi < 0; 1ΔLi<0 is defined similarly. Step 5 We compute for each marker a final return count denoting the total number of times it was returned in Step 4. Based on these counts we separate the markers into two classes: the unimportant (unlinked) markers and the important (linked AND/OR associated) ones.
This simple procedure guarantees that when evaluating a marker we consider both the marginal information, as well as the interaction information contained in a dataset. Also it takes into account two pieces of information, usually treated separately: linkage information and linkage disequilibrium information. We are currently preparing a separate paper with the details of this combined procedure.
E Results on Real Data
Table 1 lists the markers we claim important together with their chromosomal position (InfBD dataset).
Table 1.
Selected important markers
| chr | selected marker | region |
|---|---|---|
| 1 | D1S1612 | IBD7 |
| D1S534 | 1q21 | |
| D1S1595 | 1q21 | |
| D1S1677 | 1q23 | |
| 2 | D2S1394 | 2p11 |
| D2S1790 | 2p11 | |
| D1S1649 | 2q32 | |
| 3 | D3S1285 | 3p |
| 4 | D4S2394 | 4q23-4q28 |
| 5 | D5S500 | IBD5 |
| 5 | other 18 from the region | IBD5 |
| 6 | DRB1 | IBD3 |
| 6 | DQB1 | IBD3 |
| 6 | D6S1017 | IBD3 |
| 7 | GATA31A10 | 7p13 |
| 12 | D12S372 | 12p |
| 16 | D16S2619 | IBD8 |
| 16 | D16S2753 | IBD1 |
| 19 | D19S591 | IBD6 |
| 19 | GATA21G05 | IBD6 |
| 19 | D19S714 | IBD6 |
| 21 | D21S1809 | 21q22.3 |
References
- 1.Schork NJ, Boehnke M, Terwilliger JD, Ott J. Two-Trait Locus Linkage Analysis: A Powerful Strategy for Mapping Complex Genetic Traits. Am J Hum Genet. 1993;53:1127–1136. [PMC free article] [PubMed] [Google Scholar]
- 2.Cordell HJ, Todd JA, Bennett ST, Kawaguchi Y, Farrall M. Two-Locus Maximum LOD Score Analysis of a Multifactorial Trait: Joint Consideration of IDDM2 and IDDM4 with IDDM1 in Type 1 Diabetes. Am J Hum Genet. 1995;57:920–934. [PMC free article] [PubMed] [Google Scholar]
- 3.Farrall M. Affected Sibpair Linkage Tests for Multiple Linked Susceptibility Genes. Genet Epidemiol. 1997;14:103–115. doi: 10.1002/(SICI)1098-2272(1997)14:2<103::AID-GEPI1>3.0.CO;2-8. [DOI] [PubMed] [Google Scholar]
- 4.Cordell HJ, Wedig GC, Jacobs KB, Elston RC. Multilocus Linkage Tests Based on Affected Relative Pairs. Am J Hum Genet. 2000;66:1273–1286. doi: 10.1086/302847. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Efron B. Large-Scale Simultaneous Hypothesis Testing: The Choice of a Null Hypothesis. J Am Stat Assoc. 2004;99:96–104. [Google Scholar]
- 6.Lo SH, Zheng T. Backward Haplotype Transmission Association (BHTA) Algorithm - A Fast Multiple Marker Screening Method. Hum Hered. 2002;53:197–215. doi: 10.1159/000066194. [DOI] [PubMed] [Google Scholar]
- 7.Rioux JD, Silverberg MS, Daly MJ, Steinhart AH, McLeod RS, Griffiths AM, Green T, Brettin TS, Stone V, Bull SB, Bitton A, Williams CN, Greenberg GR, Cohen Z, Lander ES, Hudson TJ, Siminovitch KA. Genomewide Search in Canadian Families with Inflammatory Bowel Disease Reveals Two Novel Susceptibility Loci. Am J Hum Genet. 2000;66:1863–1870. doi: 10.1086/302913. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Lo SH, Zheng T. A Demonstration and Findings of a Statistical Approach through Re-analysis of Inflammatory Bowel Disease Data. P.N.A.S. 2004;101:10386–10391. doi: 10.1073/pnas.0403662101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Daly MJ, Kruglyak L, Pratt S, Houstis N, Reeve Mp, Kirby A, Laner ES. GENEHUNTER 2.0 - a complete linkage analysis system. Am J Hum Genet Suppl. 1998;63:A286. [Google Scholar]
- 10.Heresbach D, Alizadeh M, Dabadie A, Le Berre N, Colombel JF, Yaouanq J, Bretagne JF, Semana G. Significance of interleukin-1beta and interleukin-1 receptor antagonist genetic polymorphism in inflammatory bowel diseases. Am J Gastroenterol. 1997;92(7):1164–9. [PubMed] [Google Scholar]
- 11.Barmada MM, Brant SR, Nicolae DL, Achkar JP, Panhuysen CI, Bayless TM, Cho JH, Duerr RH. A Genome Scan in 260 Inflammatory Bowel Disease-Affected Relative Pairs. Inflammatory Bowel Disease. 2004;10(1):15–22. doi: 10.1097/00054725-200401000-00002. [DOI] [PubMed] [Google Scholar]
- 12.Katsanos KH, Tsatsoulis A, Christodoulou D, Challa A, Katsaraki A, Tsianos EV. Reduced serum insulin-like growth factor-1 (IGF-1) and IGF-binding protein-3 levels in adults with inflammatory bowel disease. Growth Horm. 2001;11(6):364–7. doi: 10.1054/ghir.2001.0248. [DOI] [PubMed] [Google Scholar]
- 13.Vestergaard EM, Brynskov J, Ejskjaer K, Clausen JT, Thim L, Nexo E, Poulsen SS. Immunoassays of human trefoil factors 1 and 2: measured on serum from patients with inflammatory bowel disease. Scand J Clin Invest. 2004;64(2):146–156. doi: 10.1080/00365510410001176. [DOI] [PubMed] [Google Scholar]

