Abstract
The goal of genome wide association (GWA) mapping in modern genetics is to identify genes or narrow regions in the genome that contribute to genetically complex phenotypes such as morphology or disease. Among the existing methods, tree-based association mapping methods show obvious advantages over single marker-based and haplotype-based methods because they incorporate information about the evolutionary history of the genome into the analysis. However, existing tree-based methods are designed primarily for binary phenotypes derived from case/control studies or fail to scale genome-wide.
In this paper, we introduce TreeQA, a quantitative GWA mapping algorithm. TreeQA utilizes local perfect phylogenies constructed in genomic regions exhibiting no evidence of historical recombination. By efficient algorithm design and implementation, TreeQA can efficiently conduct quantitative genom-wide association analysis and is more effective than the previous methods. We conducted extensive experiments on both simulated datasets and mouse inbred lines to demonstrate the efficiency and effectiveness of TreeQA.
1. Introduction
Genome wide association (GWA) mapping locates genes or narrows regions in the genome that have significant statistical connections to phenotypes of interest. The discovery of these genes and regions offers the potential to increase understanding of biological processes controlling manifestation of phenotypes.
The most frequent genetic variants are single nucleotide polymorphisms (SNPs), in which a single nucleotide in the genome differs between individuals within a species. With the development of low-cost genotyping technologies, extensive SNP data can be cheaply and efficiently produced, which further increases the computational complexity of GWA mapping. Thus, there is an evident need for fast and effective GWA mapping methods.
Existing methods of association mapping look for similarities among samples (chromosomes, haplotypes, etc.) that are correlated with the phenotypes. If strong associations are present, the variance of the phenotype within groups of similar samples is substantially smaller than the variance over all samples.
For example, in single marker-based17,5 and haplotype-based association mapping10,4,12, samples are grouped according to their genetic variation at a single marker or a set of markers. For case/control phenotypes, markers that can divide samples into (almost) pure classes are reported. Though these methods employ different strategies for grouping samples, the derived groups are evaluated without further consideration of the intergroup similarities or alternate groupings.
In observation of this, tree-based association methods14,18,13 utilize phylogenies constructed over the samples. The phylogeny tree is a rich yet compact representation of genetic similarities of the samples. It provides sensible groupings of samples at multiple resolutions. However, the existing methods either handle only case/control phenotypes14,18 or do not scale to GWA mapping13.
In this paper, we introduce TreeQA, a tree-based quantitative GWA mapping algorithm. TreeQA utilizes local perfect phylogeny trees constructed in genomic regions exhibiting no evidence of historical recombination by the 4-gamete test2. Given a perfect phylogeny, TreeQA evaluates all implied groupings and finds the strongest associations to the phenotype. Furthermore, TreeQA can identify and remove outliers during association analysis.
A brute-force implementation consists of a double loop: for every phylogeny tree, and for every grouping represented by the tree, we conduct a separate ANOVA test to measure its association to the phenotype, and keep track of the best groupings and trees. This approach is inefficient and prone to multiple test errors1. Both the number of trees and number of groupings per tree can be very largea. This large number of possible groupings requires many ANOVA tests, which is not only expensive computationally, but also gives rise to spurious associationsb. Thus, permutation tests are necessary to ensure the statistical significance of the discovered associations, which will further increase the computational burden.
TreeQA exploits the following properties: (1) Groupings generated from the same tree obey a partial order, thus allowing reuse of intermediate computations; (2) A grouping may be derived from different trees, but only need to be evaluated once; and (3) Different phenotype permutations may share a substantial number of common computations that need to be computed only once. Thus, TreeQA employs two prefix-tree structures21 to organize all observed sample subsets and groupings to facilitate the caching and retrieval of reusable computations and guide the enumeration and evaluation of groupings. As a result, TreeQA is able to handle quantitative GWA mapping very efficiently and is more effective and robust in association mapping than previous methods.
2. Related Work
Single-feature association mapping17,5 considers the sample groupings induced independently by each single marker. Statistical tests such as χ2 and F-tests are used to measure the association between the phenotype and each grouping. These methods are computationally efficient, however, they do not utilize the additional information content carried by haplotypes over single markers.
To address this shortcoming, haplotype-based methods have been developed. HAM15 considers combinations of three consecutive SNPs along the genome. QHPM8 uses frequent pattern mining methods to find haplotype patterns in the data, upon which sample groupings are created and evaluated. HapMiner10 clusters samples using consecutive subsets of markers, and then assess the phenotype’s association strength.
The utility of local phylogenies in association mapping has been recently explored in TreeLD13, Blossoc14, and TreeDT18. These methods use trees to represent sample similarities. Their approach is to exhaustively examine all possible groupings implied by the given phylogenies without explicitly excluding any outliers. Both Blossoc and TreeDT assume simple categorical (binary) phenotypes. TreeLD handles quantitative phenotypes but is not scalable to GWA analysis.
Some other work6,7,16 uses a global phylogeny structure, e.g., ancestral recombination graph, over all markers in association mapping. However, because of the high computational cost of global phylogeny construction, these methods are not scalable to genome-wide analysis.
3. Preliminaries
We use a binary matrix H = S × M to represent a SNP dataset, where S = {s1, s2, …, sn} is the set of samples, and M = {m1, m2, …, mz} is the SNP marker set. Each sample is represented by a binary vector, in which ’0’ represents the majority alleles and ’1’ represents the minority alleles. We use f(si) to denote the phenotype value of a sample si and F(S′) to denote the phenotype values of samples in a subset S′. An example matrix H containing 10 samples and 10 SNP markers with phenotype is shown in Fig. 1(a).
Compatible region
A consecutive region of the genome is called a compatible region iff any pair of markers in that region are compatible by the 4-gamete test2. That is, among the 4 possible haplotypes formed by the two markers, at most three of them occur.
A compatible region is a genomic region exhibiting no evidence of historical recombination. In Fig. 1(a), the region from markers m1 to m8 is a compatible region. We use Cu,v to denote a compatible region from markers mu to mv.
Maximal Compatible region
A compatible region is a maximal compatible region iff it can not be extended on either side to include more SNPs and remains compatible.
Perfect Phylogeny Tree
A phylogeny tree for a set of samples is perfect if the phylogeny avoids homoplasy. Every SNP is introduced by a mutation and is represented by an edge of the tree. Given a genomic region, a perfect phylogeny exists iff the region is a compatible region.
We use Tu,v to denote the perfect phylogeny tree of compatible region Cu,v. Given C1,8 in Fig. 1(a), its tree T1,8 is shown in Fig. 1(b). All samples are at the leaf nodes. Samples having identical haplotypes in the region share the same leaf node in the tree, e.g., s1 and s5. Each internal node represents a hypothetical common ancestor of a subset of samples. Each edge uniquely corresponds to a SNP (or a historical mutation). Interested readers may refer to paper3 for inferring perfect phylogenies from a set of SNPs.
Let E(Tu,v) = {e1, e2, …, ep} denote the set of edges in Tu,v. The removal of each edge partitions the samples into two subsets denoted by S(0)(ei) and S(1)(ei). Given a tree Tu,v, we can generate 2|E(Tu,v)| sample subsets by removing each edge separately. We denote this set of sample subsets by S(E)(Tu,v), S(E)(Tu,v) = {S(j)(ei)|j = {0, 1}, ei ∈ E(Tu,v)}.
Definition 3.1
A grouping of a sample subset S′, G(S′), is formed by a set of disjoint subsets of S′, . Given a tree Tu,v, we say a grouping G(S′) follows .
For example, grouping G(S′) = {{s1, s5, s2, s3}, {s8, s9, s7, s10}} follows the tree in Fig. 1(b), while grouping G(S′) = {{s1, s2}, {s8, s4}} does not.
Definition 3.2
Given a sample subset S′, G1(S′) is called a parent-grouping of G2(S′) (G2(S′) called a child-grouping of G1(S′))
A child-grouping represents a finer partition of its parent-grouping on the same set of samples. For example, grouping {{s1, s5, s2, s3}, {s4, s6}} is the parent-grouping of {{s1, s5}, {s2, s3}, {s4, s6}}. We summarize the notations in Table 1.
Table 1.
S, si, | the sample set, a sample, a subset of samples | |
M, mi | the marker set, a marker | |
H | a binary matrix representing the data | |
Cu,v | a compatible interval of H | |
f(si) | phenotype value of sample si | |
the set of phenotype values of the samples in | ||
Gi(S′) | a grouping of a sample subsets S’ | |
Tu,v | the perfect phylogeny tree of Cu,v | |
E(Tu,v) | the edge set of Tu,v | |
S(E)(Tu,v) | the set of sample subsets implied in tree Tu,v (leaf-sets) |
Association between a Compatible Region and a Phenotype
We use the one-way ANOVA test with permutations to measure the association between a grouping of samples and a quantitative phenotype. To accelerate the execution, we re-derive the formula of the ANOVA test.
Given a grouping , we calculate
(1) |
(2) |
Combining all subsets together, we have and
(3) |
We obtain a base score for grouping G(S′)
(4) |
A higher score indicates a stronger association between the grouping and the phenotype. Given the tree and the data in Fig. 1 and the following two groupings: , the scores are . Thus, grouping has a stronger association with the phenotype than grouping .
To correct the multiple test errors, we apply a permutation test on G(S′) to calculate a significance score. To permute the phenotype, the phenotype values in F(S′) are randomly re-assigned to samples in S′. Then we calculate an F-score using the permuted phenotype following Eqs.1 to 4.
Assume that we conduct nPerm random permutations in total, for each permutation, we get score Fj(j = 1…nPerm). Among the nPerm F-scores, let p be the number of scores which are greater than or equal to the base score F0(G(S′)), i.e., p = |{ Fj| Fj ≥ F0(G(S′)), j ∈ 1…nPerm}|. Then the significant score (P score) of G(S′) is
(5) |
A higher P score indicates that the association between grouping G(S′) and the phenotype is more significant.
Definition 3.3. The association between a compatible region and a phenotype
For a compatible region Cu,v, the highest P score achieved by any grouping following Tu,v is regarded as the P score of Cu,v. The P score represents the association between the compatible region and the phenotype,
(6) |
Problem Definition
Given a SNP data and a quantitative phenotype, calculate the P-score of every maximal compatible region and report the most significant ones.
4. TreeQA Algorithm
TreeQA takes two major steps: 1) identify maximal compatible regions in the genome and construct the perfect phylogenies of the regions; 2) compute the association between each compatible region and the phenotype.
4.1. Maximal Compatible Region and Phylogeny Construction
TreeQA scans the markers in a left to right order. In order to find the maximal compatible regions, it continuously extends the current region by adding the next marker until the new marker is incompatible with some markers in the region. And it maximizes the overlap between two consecutive regions. Assume that the current compatible region is Cu,v, and marker mv+1 is incompatible with markers mi1, …, mik, u ≤ i1 < … < ik ≤ v, then TreeQA starts the next compatible region at marker mik+1. For each maximal compatible region, TreeQA utilizes the inferring algorithm3 to construct the local perfect phylogeny.
4.2. Association Computing
In the second step, TreeQA takes as input a quantitative phenotype and a set of local perfect phylogenies. It considers all possible groupings following the phylogenies and systematically explores the search space of these groupings in a carefully designed order such that intermediate computations can be maximally reused.
According to Definition 3.1, any grouping of a sample subsetc that follows a tree Tu,v can be created from non-overlapping subsets in S(E)(Tu,v). By utilizing the lexicographical orderd of subsets in S(E)(Tu,v), TreeQA can enumerate and evaluate all combinations of non-overlapping subsets systematically.
TreeQA enumerates all groupings via a depth-first recursive procedure. TreeQA extends the current grouping by including a new sample subset which does not overlap with any subsets in the current grouping. The association of each new grouping to the phenotype via a permutation test is computed. The P score of the corresponding maximal compatible region is updated accordingly. The enumeration continues recursively for each newly extended grouping.
Consider the tree in Figure 1. There are 14 sample subsets in S(E)(T1,8). Assume that the subsets have the following order,
TreeQA first generates a grouping containing se1 only. Among the remaining sample subsets, {se2, se3, se5, se7, se9, se12, se13} do not overlap with se1. In the next step, a grouping {se1, se2} is formed by adding se2 into the current grouping and its P score is calculated. P(C1,8) is updated accordingly. Since all other sample subsets overlap with se1 or se2. Thus, no new grouping can be extended from {se1, se2}. Then, TreeQA examines the next grouping extended from {se1}, {se1, se3}, and all groupings extended from it. After examining all groupings containing se1, TreeQA will start from the grouping {se2} and extend it recursively to generate all groupings containing se2 but not se1. This process continues until all distinct groupings are enumerated.
4.3. Effective Permutation
We found that more than 90% of the execution time of TreeQA is spent in permutation tests. Given a grouping G(S′), a permutation test is conducted in two steps: 1) randomly re-assigning the phenotype values in F(S′) to samples in S′; 2) calculating the corresponding F score by Eq. 4.
Given a subset S′, both steps take O(|S′|) time. TreeQA exploits maximal reusability of intermediate computation shared by permutation through the following two optimizations:
in Tree: Common computation units shared by permutation tests of parent/child-groupings in a tree.
amg Tree: Common computation units shared by permutation tests on groupings following multiple trees.
We use two global prefix-tree structures21, Treegrouping and Treesubset to organize groupings and sample subsets examined thus far respectively to enable effective permutation tests.
4.3.1. inTree: Effective permutation tests within a tree
A pair of parent/child-groupings always involve the same set of samples. Let S′ denote a set of samples. For the permutation tests of the parent/child groupings of S′, instead of re-assigning the phenotype values in F(S′) independently for each grouping, they can share the same set of random permutations of F(S′).
For example, given the example in Fig. 1 and a pair of parent/child-groupings, G1(S′) = {{s1, s5, s2, s3}, {s8, s9, s7, s10}} and G2(S′) = {{s1, s5}, {s2, s3}, {s8, s9, s7, s10}}, their F0 scores are: F0(G1(S′)) = 9.79 and F0(G2(S′)) = 4.32. Assume that after a random permutation, the new phenotype values for the samples are: f(s1) = 85, f(s2) = 79, f(s3) = 109, f(s5) = 61, f(s7) = 86, f(s8) = 97, f(s9) = 78, f(s10) = 54. Using this new assignment, we can calculate the new F scores for both groupings: F (G1(S′)) = 0.12 and F (G2(S′)) = 0.7. By reusing the phenotype permutation between G1(S′) and G2(S′), we save O(|S′|) runtime in each permutation.
A child-grouping represents a finer partition of sample subsets in its parent-grouping. We say a grouping is at the finest level if it does not have any child-groupings. We use a global prefix-tree Treegrouping to index all groupings and maintain the parent/child relationship through auxiliary links (from a child-grouping to its parent-groupings). For each permutation of the phenotype, the F scores of a finest grouping and all of its parent-groupings are calculated together. We examine the finest grouping immediately followed by the examination of its parent groupings for maximum computation reuse. If a finest child-grouping has n parent-groupings, we save O(n|S′|) time in each permutation.
4.3.2. amgTree: Effective permutation among trees
The same grouping occurs repeatedly in different trees. We only need to compute its P score at its first occurrence. We use Treegrouping to store and retrieve the P score of all examined groupings. If the grouping formed by TreeQA can be found in Treegrouping, its P score is directly used. Otherwise, its P score is calculated and stored in Treegrouping.
4.4. Reuse of Intermediate Computation of Statistical Tests
For any sample subset S′, SQ(S′) and SM(S′) calculated using the original phenotype values (with no permutation) may be reused in any grouping containing S′ and all its parent-groupings. We denote them by SQ0(S′) and SM0(S′) respectively in the following discussion.
We employ a global prefix-tree Treesubset to keep track of all sample subsets in any groupings examined thus far. Three values are stored at the leaf node corresponding to the subset S′: (subset ID, SQ0(S′), SM0(S′)).
For example, given the 10 samples and their phenotype values in Fig. 1(a), we calculate the base score F0 of grouping G1(S′) = {{s1, s5}, {s2, s3}, {s7, s10}}.
The SQ0 and SM0 values of the three subsets are then stored in Treesubset. Given a parent-grouping of G1(S′), G2(S′) = {{s1, s5, s2, s3}, {s7, s10}}, we can retrieve the values of SQ0 and SM0 and use them to calculate F0(G2(S′)),
The reuse of SQ0(S′) and SM0(S′) between parent/child groupings may work in conjunction with the inTree effective permutation. Besides, SQ0(S′) and SM0(S′) can also be reused by any groupings that contain the subset S′.
5. Results
We compare TreeQA with the following algorithms: 1) SMA, our implementation of the Single Marker Association algorithm17,5; 2) HAM, our implementation of the Haplotype Association Mapping algorithm15 that slides a 3-SNP window through the genome; 3) HapMiner10, downloaded from the websitee; and 4) TreeLD13, downloaded from the websitef. Both SMA and HAM use the oneway ANOVA test for fair comparison.
QHPM8 is not used for comparison because it is not scalable to large data sets. Blossoc14 and TreeDT18 are not used because they require categorical phenotypes.
5.1. Experiments on Simulated Data
We use Coasim11 to simulate 1000 sequences with scaled recombination rate ρ = 400 that corresponds roughly to 10 cM. 10,000 SNP markers are placed uniformly at random over the sequences.
SNP markers on the sequences are randomly selected as causative loci with one, two and three causative mutations. The first SNP is always selected randomly from all SNPs. In the cases of two and three mutations, the second and third causative SNPs are selected from compatible SNPs that are located less than 10 SNPs away from the first SNP. Phenotype values are sampled from four Gaussian distributions: N1(140, 35), N2(90, 35), N3(50, 40), and N4(10, 35). The one-mutation case uses N1 and N3. The two-mutation case uses N1, N2 and N3. The three-mutation case uses all four Gaussian distributions. After assigning the phenotype values, all causative SNPs are removed from the data and we randomly select 100 sequences for our experiments.
SMA, HAM and HapMiner output the top one scoring locus as a point estimation of the causative locus, while TreeQA outputs the top one compatible region. We compare the effectiveness of the algorithms by measuring the distance (in cM) from the top one scoring locus or the center of the top one region to the causative SNP (or the average distance to every causative SNP). We call the distance the Prediction error.
Since HapMiner can not finish processing 10,000 SNP markers in a reasonable time, we only use the first 1,000 markers of each sequence when applying HapMiner on the simulated data.
The comparison of SMA, HAM, HapMiner and TreeQA is shown in Figure 2. The x-axis represents the prediction error (distance) to the causative locus and the y-axis represents the percentage of causative loci which are found in distance less than x. In all three cases, the estimated loci by TreeQA are closer to the causative loci than those by SMA, HAM and HapMiner.
The TreeLD algorithm uses local phylogenies and analyzes quantitative phenotypes. However, TreeLD can only process a very small amount of data in reasonable time. Therefore, we select 36 samples and 20 SNP markers from the simulated data for performance comparison. A one-mutation causative locus is selected from the 20 SNPs. For TreeQA, instead of generating maximal compatible regions as discussed in Sec. 4, a compatible region is generated around each SNP and contains up to five SNPs. TreeLD takes about two hours to analyze this small data while TreeQA finishes in seconds. Figure 3 plots the results from TreeLD and TreeQA. The x-axis represents the simulated positions in the genome and the y-axis represents the scores of the SNPs. The vertical line demonstrates the causative locus. TreeQA detects a peak near the causative locus while TreeLD identifies two spurious peaks.
5.2. Experiments on Mouse Genotype Data
We used a set of mouse genotypes that combines experimental and imputed datag20 from the Jackson Laboratory, consisting of 74 samples. The dataset contains over 7 million SNP markers distributed over all 20 chromosomes. We removed wild derived mouse inbred strains since they are quantitatively and qualitatively different than other laboratory inbred strains and we only used in our experiments the remaining 55 samples that have a share set of common ancestral relationships19.
We used high density lipoprotein cholesterol (HDL-C) levels in blood as the test phenotype, downloaded from the Mouse Phenome Databaseh. Several HDL-C datasets are available, each of which was collected under different conditions, and are thus treated as separate phenotypes. Some candidate genes that may play a role in regulating HDL-C levels are reported in9.
We apply SMA, HAM and TreeQA on the data and examine how close they can identify the top peak near the locus of those candidate genes.
TreeQA detects top peaks near the locations for over 10 of the candidate genes9, including Ppara, Abcb4 and Rxrb. The top peaks reported by SMA and HAM are often far from the locations of these genes. Due to space limitation, we only show the results for one of them, Abcb4, in Figure 4.
The perfect phylogeny corresponding to the peak point (compatible region from 8799298 to 8801558 (base)) found by TreeQA is plotted in Fig. 5. The phenotype values of the samples are in parentheses. Samples with unknown phenotype values are omitted from the tree. The subtree on the right contains samples having high phenotype values while the subtree at the bottom contains samples having low values. Other subtrees are considered as outliers and are excluded from the grouping. SMA and HAM fail to identify the locus because they only examine sample groupings that can be generated from single SNPs or 3-SNP windows, which are a small subset of the groupings examined by TreeQA.
TreeQA takes about 10 minutes to analyze each chromosome which contains around 40000 SNPs on average. SMA and HAM take slightly less time than TreeQA. Both HapMiner and TreeLD are unable to finish in reasonable time.
6. Conclusion
In this paper, we present a tree-based quantitative GWA mapping algorithm, TreeQA. TreeQA utilizes local perfect phylogenies in detecting associations. Perfect phylogenies provide sensible groupings of samples at multiple resolutions. TreeQA explores the space of all possible groupings implied by the perfect phylogenies in a carefully designed order so that intermediate computations can be maximally reused. Our experimental results on both simulated and real data show that TreeQA can efficiently conduct quantitative GWA analysis and is more effective than the previous methods.
Footnotes
For example, the number of trees can exceed tens of thousands in a chromosome-wide association study. And there are up to 22n−2 groupings that can be generated from a tree of n samples.
With ɛ error rate, the risk of reporting at least one spurious association from x tests is 1 − (1 − ɛ)x.
Considering groupings of a sample subset allows TreeQA to exclude potential outliers from the ANOVA test.
Any other ways of defining a total order of the subsets would also work.
Contributor Information
Feng Pan, Email: panfeng@cs.unc.edu.
Leonard McMillan, Email: mcmillan@cs.unc.edu.
Fernando Pardo-Manuel de Villena, Email: fernando@med.unc.edu.
David Threadgill, Email: dwt@med.unc.edu.
Wei Wang, Email: weiwang@cs.unc.edu.
References
- 1.Miller RG. Simultaneous Statistical Inference. New York: Springer Verlag; 1981. [Google Scholar]
- 2.Hudson RR, Kaplan NL. Statistical properties of the number of recombination events in the history of a sample of dna sequences. Genetics. 1985;111(1):147C164. doi: 10.1093/genetics/111.1.147. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Agarwala R, Fernandez-Baca D, Slutzki G. Fast algorithms for inferring evolutionary trees. Journal of Computational Biology. 1995;2(3):397–408. doi: 10.1089/cmb.1995.2.397. [DOI] [PubMed] [Google Scholar]
- 4.Toivonen H, Onkamo P, Vasko K, et al. Data mining applied to linkage disequilibrium mapping. Am J Hum Genet. 2000;67:133–145. doi: 10.1086/302954. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Akey J, Jin L, Xiong M. Haplotypes vs single marker linkage disequilibrium tests: what do we gain? Eur J. Hum Genet. 2001;9(4):291–300. doi: 10.1038/sj.ejhg.5200619. [DOI] [PubMed] [Google Scholar]
- 6.Larribea F, Lessarda S, Schork NJ. Gene mapping via the ancestral recombination graph. Theoretical Population Biology. 2002;62(2):215–229. doi: 10.1006/tpbi.2002.1601. [DOI] [PubMed] [Google Scholar]
- 7.Morris AP, Whittaker JC, Balding DJ. Fine-scale mapping of disease loci via shattered coalescent modeling of genealogies. Am J Hum Genet. 2002;70(3) doi: 10.1086/339271. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Onkamo P, Ollikainen V, Sevon P, et al. Association analysis for quantitative traits by data mining: Qhpm. Ann. Hum. Genet. 2002;66:419–429. doi: 10.1017/S0003480002001318. [DOI] [PubMed] [Google Scholar]
- 9.Wang X, Paigen B. Quantitative trait loci and candidate genes regulating hdl cholesterol. Arteriosclerosis, Thrombosis, and Vascular Biology. 2002;22:1390. doi: 10.1161/01.atv.0000030201.29121.a3. [DOI] [PubMed] [Google Scholar]
- 10.Li J, Jiang T. Haplotype-based linkage disequilibrium mapping via direct data mining. Bioinformatics. 2005;21(24):4384–4393. doi: 10.1093/bioinformatics/bti732. [DOI] [PubMed] [Google Scholar]
- 11.Mailund T, Schierup MH, et al. Coasim: A flexible environment for simulating genetic data under coalescent model. BMC Bioinformatics. 2005;6(252) doi: 10.1186/1471-2105-6-252. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Waldron E, Whittaker J, Balding D. Fine mapping of disease genes via haplotype clustering. Genetic Epidemiology. 2005;30:170–179. doi: 10.1002/gepi.20134. [DOI] [PubMed] [Google Scholar]
- 13.Zöllner S, Pritchard JK. Coalescent-based association mapping and fine mapping of complex trait loci. Genetics. 2005;169(2):1071–1092. doi: 10.1534/genetics.104.031799. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Mailund T, Besenbacher S, Schierup MH. Whole genome association mapping by incompatibilities and local perfect phylogenies. BMC Bioinformatics. 2006;7:454. doi: 10.1186/1471-2105-7-454. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.McClurg P, Pletcher MT, Wiltshire T, Su AI. Comparative analysis of haplotype association mapping algorithms. BMC Bioinformatics. 2006;7:61. doi: 10.1186/1471-2105-7-61. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Minichiello MJ, Durbin R. Mapping Trait Loci by Use of Inferred Ancestral Recombination Graphs. Am J Hum Genet. 2006;79(5):910–922. doi: 10.1086/508901. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Pe’er I, et al. Evaluating and improving power in whole-genome association studies using fixed marker sets. Nature Genetics. 2006;38:663–667. doi: 10.1038/ng1816. [DOI] [PubMed] [Google Scholar]
- 18.Sevon P, Toivonen H, Ollikainen V. Treedt: Tree pattern mining for gene mapping. IEEE Transactions on Computational Biology and Bioinformatics. 2006;3(2) doi: 10.1109/TCBB.2006.28. [DOI] [PubMed] [Google Scholar]
- 19.Yang H, Bell TA, Churchill GA, de Villena FP-M. On the subspecific origin of the laboratory mouse. Nature Genetics. 2007;39 doi: 10.1038/ng2087. [DOI] [PubMed] [Google Scholar]
- 20.Szatkiewicz JP, Beane GL, Ding Y, Hutchins L, de Villena FP-M, Churchill GA. An imputed genotype resource for the laboratory mouse. Mammalian Genome. 2008;19(3) doi: 10.1007/s00335-008-9098-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Cormen TH, Leiserson CE, Rivest RL, Stein C. Introduction to algorithms [Google Scholar]