TREEQA: QUANTITATIVE GENOME WIDE ASSOCIATION MAPPING USING LOCAL PERFECT PHYLOGENY TREES

Feng Pan; Leonard McMillan; Fernando Pardo-Manuel de Villena; David Threadgill; Wei Wang

. Author manuscript; available in PMC: 2010 Jan 1.

Published in final edited form as: Pac Symp Biocomput. 2009:415–426.

TREEQA: QUANTITATIVE GENOME WIDE ASSOCIATION MAPPING USING LOCAL PERFECT PHYLOGENY TREES

Feng Pan ¹, Leonard McMillan ¹, Fernando Pardo-Manuel de Villena ², David Threadgill ², Wei Wang ¹

PMCID: PMC2739990 NIHMSID: NIHMS132006 PMID: 19209719

Abstract

The goal of genome wide association (GWA) mapping in modern genetics is to identify genes or narrow regions in the genome that contribute to genetically complex phenotypes such as morphology or disease. Among the existing methods, tree-based association mapping methods show obvious advantages over single marker-based and haplotype-based methods because they incorporate information about the evolutionary history of the genome into the analysis. However, existing tree-based methods are designed primarily for binary phenotypes derived from case/control studies or fail to scale genome-wide.

In this paper, we introduce TreeQA, a quantitative GWA mapping algorithm. TreeQA utilizes local perfect phylogenies constructed in genomic regions exhibiting no evidence of historical recombination. By efficient algorithm design and implementation, TreeQA can efficiently conduct quantitative genom-wide association analysis and is more effective than the previous methods. We conducted extensive experiments on both simulated datasets and mouse inbred lines to demonstrate the efficiency and effectiveness of TreeQA.

1. Introduction

Genome wide association (GWA) mapping locates genes or narrows regions in the genome that have significant statistical connections to phenotypes of interest. The discovery of these genes and regions offers the potential to increase understanding of biological processes controlling manifestation of phenotypes.

The most frequent genetic variants are single nucleotide polymorphisms (SNPs), in which a single nucleotide in the genome differs between individuals within a species. With the development of low-cost genotyping technologies, extensive SNP data can be cheaply and efficiently produced, which further increases the computational complexity of GWA mapping. Thus, there is an evident need for fast and effective GWA mapping methods.

Existing methods of association mapping look for similarities among samples (chromosomes, haplotypes, etc.) that are correlated with the phenotypes. If strong associations are present, the variance of the phenotype within groups of similar samples is substantially smaller than the variance over all samples.

For example, in single marker-based¹⁷^,⁵ and haplotype-based association mapping¹⁰^,⁴^,¹², samples are grouped according to their genetic variation at a single marker or a set of markers. For case/control phenotypes, markers that can divide samples into (almost) pure classes are reported. Though these methods employ different strategies for grouping samples, the derived groups are evaluated without further consideration of the intergroup similarities or alternate groupings.

In observation of this, tree-based association methods¹⁴^,¹⁸^,¹³ utilize phylogenies constructed over the samples. The phylogeny tree is a rich yet compact representation of genetic similarities of the samples. It provides sensible groupings of samples at multiple resolutions. However, the existing methods either handle only case/control phenotypes¹⁴^,¹⁸ or do not scale to GWA mapping¹³.

In this paper, we introduce TreeQA, a tree-based quantitative GWA mapping algorithm. TreeQA utilizes local perfect phylogeny trees constructed in genomic regions exhibiting no evidence of historical recombination by the 4-gamete test². Given a perfect phylogeny, TreeQA evaluates all implied groupings and finds the strongest associations to the phenotype. Furthermore, TreeQA can identify and remove outliers during association analysis.

A brute-force implementation consists of a double loop: for every phylogeny tree, and for every grouping represented by the tree, we conduct a separate ANOVA test to measure its association to the phenotype, and keep track of the best groupings and trees. This approach is inefficient and prone to multiple test errors¹. Both the number of trees and number of groupings per tree can be very large^a. This large number of possible groupings requires many ANOVA tests, which is not only expensive computationally, but also gives rise to spurious associations^b. Thus, permutation tests are necessary to ensure the statistical significance of the discovered associations, which will further increase the computational burden.

TreeQA exploits the following properties: (1) Groupings generated from the same tree obey a partial order, thus allowing reuse of intermediate computations; (2) A grouping may be derived from different trees, but only need to be evaluated once; and (3) Different phenotype permutations may share a substantial number of common computations that need to be computed only once. Thus, TreeQA employs two prefix-tree structures²¹ to organize all observed sample subsets and groupings to facilitate the caching and retrieval of reusable computations and guide the enumeration and evaluation of groupings. As a result, TreeQA is able to handle quantitative GWA mapping very efficiently and is more effective and robust in association mapping than previous methods.

2. Related Work

Single-feature association mapping¹⁷^,⁵ considers the sample groupings induced independently by each single marker. Statistical tests such as χ² and F-tests are used to measure the association between the phenotype and each grouping. These methods are computationally efficient, however, they do not utilize the additional information content carried by haplotypes over single markers.

To address this shortcoming, haplotype-based methods have been developed. HAM¹⁵ considers combinations of three consecutive SNPs along the genome. QHPM⁸ uses frequent pattern mining methods to find haplotype patterns in the data, upon which sample groupings are created and evaluated. HapMiner¹⁰ clusters samples using consecutive subsets of markers, and then assess the phenotype’s association strength.

The utility of local phylogenies in association mapping has been recently explored in TreeLD¹³, Blossoc¹⁴, and TreeDT¹⁸. These methods use trees to represent sample similarities. Their approach is to exhaustively examine all possible groupings implied by the given phylogenies without explicitly excluding any outliers. Both Blossoc and TreeDT assume simple categorical (binary) phenotypes. TreeLD handles quantitative phenotypes but is not scalable to GWA analysis.

Some other work⁶^,⁷^,¹⁶ uses a global phylogeny structure, e.g., ancestral recombination graph, over all markers in association mapping. However, because of the high computational cost of global phylogeny construction, these methods are not scalable to genome-wide analysis.

3. Preliminaries

We use a binary matrix H = S × M to represent a SNP dataset, where S = {s₁, s₂, …, s_n} is the set of samples, and M = {m₁, m₂, …, m_z} is the SNP marker set. Each sample is represented by a binary vector, in which ’0’ represents the majority alleles and ’1’ represents the minority alleles. We use f(s_i) to denote the phenotype value of a sample s_i and F(S′) to denote the phenotype values of samples in a subset S′. An example matrix H containing 10 samples and 10 SNP markers with phenotype is shown in Fig. 1(a).

Example: a SNP dataset and a perfect phylogeny tree

Compatible region

A consecutive region of the genome is called a compatible region iff any pair of markers in that region are compatible by the 4-gamete test². That is, among the 4 possible haplotypes formed by the two markers, at most three of them occur.

A compatible region is a genomic region exhibiting no evidence of historical recombination. In Fig. 1(a), the region from markers m₁ to m₈ is a compatible region. We use C_u,v to denote a compatible region from markers m_u to m_v.

Maximal Compatible region

A compatible region is a maximal compatible region iff it can not be extended on either side to include more SNPs and remains compatible.

Perfect Phylogeny Tree

A phylogeny tree for a set of samples is perfect if the phylogeny avoids homoplasy. Every SNP is introduced by a mutation and is represented by an edge of the tree. Given a genomic region, a perfect phylogeny exists iff the region is a compatible region.

We use T_u,v to denote the perfect phylogeny tree of compatible region C_u,v. Given C_1,8 in Fig. 1(a), its tree T_1,8 is shown in Fig. 1(b). All samples are at the leaf nodes. Samples having identical haplotypes in the region share the same leaf node in the tree, e.g., s₁ and s₅. Each internal node represents a hypothetical common ancestor of a subset of samples. Each edge uniquely corresponds to a SNP (or a historical mutation). Interested readers may refer to paper³ for inferring perfect phylogenies from a set of SNPs.

Let E(T_u,v) = {e₁, e₂, …, e_p} denote the set of edges in T_u,v. The removal of each edge partitions the samples into two subsets denoted by S⁽⁰⁾(e_i) and S⁽¹⁾(e_i). Given a tree T_u,v, we can generate 2|E(T_u,v)| sample subsets by removing each edge separately. We denote this set of sample subsets by S^(E)(T_u,v), S^(E)(T_u,v) = {S^(j)(e_i)|j = {0, 1}, e_i ∈ E(T_u,v)}.

Definition 3.1

A grouping of a sample subset S′, G(S′), is formed by a set of disjoint subsets of S′, $G (S') = {S_{1}^{'}, S_{2}^{'}, \dots, S_{k}^{'}}, S_{i}^{'} \subset S', S_{i}^{'} \cap S_{j}^{'} = \emptyset, \cup_{i = 1}^{k} S_{i}^{'} = S'$ . Given a tree T_u,v, we say a grouping G(S′) follows $T_{u, v} iff \forall S_{i}^{'} \in G (S'), S_{i}^{'} \in S^{(E)} (T_{u, v})$ .

For example, grouping G(S′) = {{s₁, s₅, s₂, s₃}, {s₈, s₉, s₇, s₁₀}} follows the tree in Fig. 1(b), while grouping G(S′) = {{s₁, s₂}, {s₈, s₄}} does not.

Definition 3.2

Given a sample subset S′, G₁(S′) is called a parent-grouping of G₂(S′) (G₂(S′) called a child-grouping of G₁(S′)) $iff \forall S_{i}^{'} \in G_{1} (S')$

\exists S_{j}^{'} \in G_{2} (S'), s . t . S_{i}^{'} = S_{j}^{'} . OR \exists {S_{j_{q}}^{'} | S_{j_{q}}^{'} \in G_{2} (S'), q = 1, \dots, u}, s . t . S_{i}^{'} = \cup_{q = 1}^{u} S_{j_{q}}^{'}

A child-grouping represents a finer partition of its parent-grouping on the same set of samples. For example, grouping {{s₁, s₅, s₂, s₃}, {s₄, s₆}} is the parent-grouping of {{s₁, s₅}, {s₂, s₃}, {s₄, s₆}}. We summarize the notations in Table 1.

Table 1.

Summary of Notations

S, s_i,

S_{i}^{'}

the sample set, a sample, a subset of samples

M, m_i

the marker set, a marker

a binary matrix representing the data

C_u,v

a compatible interval of H

f(s_i)

phenotype value of sample s_i

F (S_{i}^{'})

the set of phenotype values of the samples in

S_{i}^{'}

G_i(S′)

a grouping of a sample subsets S’

T_u,v

the perfect phylogeny tree of C_u,v

E(T_u,v)

the edge set of T_u,v

S^(E)(T_u,v)

the set of sample subsets implied in tree T_u,v (leaf-sets)

Open in a new tab

Association between a Compatible Region and a Phenotype

We use the one-way ANOVA test with permutations to measure the association between a grouping of samples and a quantitative phenotype. To accelerate the execution, we re-derive the formula of the ANOVA test.

Given a grouping $G (S') = {S_{1}^{'}, \dots, S_{k}^{'}}, for every S_{i}^{'} \in G (S')$ , we calculate

S Q (S_{i}^{'}) = \sum_{s_{j} \in S_{i}^{'}} f {(s_{j})}^{2}, S M (S_{i}^{'}) = \sum_{s_{j} \in S_{i}^{'}} f (s_{j})

(1)

{SSE}_{i} = SQ (S_{i}^{'}) - SM {(S_{i}^{'})}^{2} / | S_{i}^{'} |, {SSB}_{i} = SM {(S_{i}^{'})}^{2} / | S_{i}^{'} |

(2)

Combining all subsets together, we have $M M = \frac{1}{| S' |} \sum_{i = 1}^{k} S M (S_{i}^{'})$ and

M S E = \frac{1}{| S' | - k} \sum_{i = 1}^{k} S S E_{i}, M S B = \frac{1}{k - 1} (\sum_{i = 1}^{k} S S B_{i} - | S' | \cdot M M^{2})

(3)

We obtain a base score for grouping G(S′)

F_{0} (G (S')) = \frac{MSB}{MSE}

(4)

A higher score indicates a stronger association between the grouping and the phenotype. Given the tree and the data in Fig. 1 and the following two groupings: $G (S_{1}^{'}) = {{s_{2}, s_{3}}, {s_{4}, s_{6}}, {s_{8}, s_{9}}}, G (S_{2}^{'}) = {{s_{2}, s_{3}}, {s_{8}, s_{9}}}$ , the scores are $F_{0} (G (S_{1}^{'})) = 0.44, F_{0} (G (S_{2}^{'})) = 4.16$ . Thus, grouping $G (S_{2}^{'})$ has a stronger association with the phenotype than grouping $G (S_{1}^{'})$ .

To correct the multiple test errors, we apply a permutation test on G(S′) to calculate a significance score. To permute the phenotype, the phenotype values in F(S′) are randomly re-assigned to samples in S′. Then we calculate an F-score using the permuted phenotype following Eqs.1 to 4.

Assume that we conduct nPerm random permutations in total, for each permutation, we get score F_j(j = 1…nPerm). Among the nPerm F-scores, let p be the number of scores which are greater than or equal to the base score F₀(G(S′)), i.e., p = |{ F_j| F_j ≥ F₀(G(S′)), j ∈ 1…nPerm}|. Then the significant score (P score) of G(S′) is

P (G (S')) = {log}_{10} (\frac{nPerm}{p})

(5)

A higher P score indicates that the association between grouping G(S′) and the phenotype is more significant.

Definition 3.3. The association between a compatible region and a phenotype

For a compatible region C_u,v, the highest P score achieved by any grouping following T_u,v is regarded as the P score of C_u,v. The P score represents the association between the compatible region and the phenotype,

P (C_{u, v}) = max {P (G_{j} (S')) | \forall G_{j} (S') follows T_{u, v}, S' \subseteq S} .

(6)

Problem Definition

Given a SNP data and a quantitative phenotype, calculate the P-score of every maximal compatible region and report the most significant ones.

4. TreeQA Algorithm

TreeQA takes two major steps: 1) identify maximal compatible regions in the genome and construct the perfect phylogenies of the regions; 2) compute the association between each compatible region and the phenotype.

4.1. Maximal Compatible Region and Phylogeny Construction

TreeQA scans the markers in a left to right order. In order to find the maximal compatible regions, it continuously extends the current region by adding the next marker until the new marker is incompatible with some markers in the region. And it maximizes the overlap between two consecutive regions. Assume that the current compatible region is C_u,v, and marker m_v+1 is incompatible with markers m_i₁, …, m_{i_k}, u ≤ i₁ < … < i_k ≤ v, then TreeQA starts the next compatible region at marker m_{i_k+1}. For each maximal compatible region, TreeQA utilizes the inferring algorithm³ to construct the local perfect phylogeny.

4.2. Association Computing

In the second step, TreeQA takes as input a quantitative phenotype and a set of local perfect phylogenies. It considers all possible groupings following the phylogenies and systematically explores the search space of these groupings in a carefully designed order such that intermediate computations can be maximally reused.

According to Definition 3.1, any grouping of a sample subset^c that follows a tree T_u,v can be created from non-overlapping subsets in S^(E)(T_u,v). By utilizing the lexicographical order^d of subsets in S^(E)(T_u,v), TreeQA can enumerate and evaluate all combinations of non-overlapping subsets systematically.

TreeQA enumerates all groupings via a depth-first recursive procedure. TreeQA extends the current grouping by including a new sample subset which does not overlap with any subsets in the current grouping. The association of each new grouping to the phenotype via a permutation test is computed. The P score of the corresponding maximal compatible region is updated accordingly. The enumeration continues recursively for each newly extended grouping.

Consider the tree in Figure 1. There are 14 sample subsets in S^(E)(T_1,8). Assume that the subsets have the following order,

\begin{array}{l} s e_{1} = {s_{1}, s_{5}}, s e_{2} = S - s e_{1}, s e_{3} = {s_{2}, s_{3}}, s e_{4} = S - s e_{3}, s e_{5} = {s_{4}, s_{6}} \\ s e_{6} = S - s e_{5}, s e_{7} = {s_{8}, s_{9}}, s e_{8} = S - s e_{7}, s e_{9} = {s_{7}, s_{10}}, s e_{10} = S - s e_{9} \\ s e_{11} = {s_{1}, s_{5}, s_{2}, s_{3}}, s e_{12} = S - s e_{11}, s e_{13} = {s_{8}, s_{9}, s_{7}, s_{10}}, s e_{14} = S - s e_{13} \end{array}

TreeQA first generates a grouping containing se₁ only. Among the remaining sample subsets, {se₂, se₃, se₅, se₇, se₉, se₁₂, se₁₃} do not overlap with se₁. In the next step, a grouping {se₁, se₂} is formed by adding se₂ into the current grouping and its P score is calculated. P(C_1,8) is updated accordingly. Since all other sample subsets overlap with se₁ or se₂. Thus, no new grouping can be extended from {se₁, se₂}. Then, TreeQA examines the next grouping extended from {se₁}, {se₁, se₃}, and all groupings extended from it. After examining all groupings containing se₁, TreeQA will start from the grouping {se₂} and extend it recursively to generate all groupings containing se₂ but not se₁. This process continues until all distinct groupings are enumerated.

4.3. Effective Permutation

We found that more than 90% of the execution time of TreeQA is spent in permutation tests. Given a grouping G(S′), a permutation test is conducted in two steps: 1) randomly re-assigning the phenotype values in F(S′) to samples in S′; 2) calculating the corresponding F score by Eq. 4.

Given a subset S′, both steps take O(|S′|) time. TreeQA exploits maximal reusability of intermediate computation shared by permutation through the following two optimizations:

in Tree: Common computation units shared by permutation tests of parent/child-groupings in a tree.
amg Tree: Common computation units shared by permutation tests on groupings following multiple trees.

We use two global prefix-tree structures²¹, Tree_grouping and Tree_subset to organize groupings and sample subsets examined thus far respectively to enable effective permutation tests.

4.3.1. inTree: Effective permutation tests within a tree

A pair of parent/child-groupings always involve the same set of samples. Let S′ denote a set of samples. For the permutation tests of the parent/child groupings of S′, instead of re-assigning the phenotype values in F(S′) independently for each grouping, they can share the same set of random permutations of F(S′).

For example, given the example in Fig. 1 and a pair of parent/child-groupings, G₁(S′) = {{s₁, s₅, s₂, s₃}, {s₈, s₉, s₇, s₁₀}} and G₂(S′) = {{s₁, s₅}, {s₂, s₃}, {s₈, s₉, s₇, s₁₀}}, their F₀ scores are: F₀(G₁(S′)) = 9.79 and F₀(G₂(S′)) = 4.32. Assume that after a random permutation, the new phenotype values for the samples are: f(s₁) = 85, f(s₂) = 79, f(s₃) = 109, f(s₅) = 61, f(s₇) = 86, f(s₈) = 97, f(s₉) = 78, f(s₁₀) = 54. Using this new assignment, we can calculate the new F scores for both groupings: F (G₁(S′)) = 0.12 and F (G₂(S′)) = 0.7. By reusing the phenotype permutation between G₁(S′) and G₂(S′), we save O(|S′|) runtime in each permutation.

A child-grouping represents a finer partition of sample subsets in its parent-grouping. We say a grouping is at the finest level if it does not have any child-groupings. We use a global prefix-tree Tree_grouping to index all groupings and maintain the parent/child relationship through auxiliary links (from a child-grouping to its parent-groupings). For each permutation of the phenotype, the F scores of a finest grouping and all of its parent-groupings are calculated together. We examine the finest grouping immediately followed by the examination of its parent groupings for maximum computation reuse. If a finest child-grouping has n parent-groupings, we save O(n|S′|) time in each permutation.

4.3.2. amgTree: Effective permutation among trees

The same grouping occurs repeatedly in different trees. We only need to compute its P score at its first occurrence. We use Tree_grouping to store and retrieve the P score of all examined groupings. If the grouping formed by TreeQA can be found in Tree_grouping, its P score is directly used. Otherwise, its P score is calculated and stored in Tree_grouping.

4.4. Reuse of Intermediate Computation of Statistical Tests

For any sample subset S′, SQ(S′) and SM(S′) calculated using the original phenotype values (with no permutation) may be reused in any grouping containing S′ and all its parent-groupings. We denote them by SQ₀(S′) and SM₀(S′) respectively in the following discussion.

We employ a global prefix-tree Tree_subset to keep track of all sample subsets in any groupings examined thus far. Three values are stored at the leaf node corresponding to the subset S′: (subset ID, SQ₀(S′), SM₀(S′)).

For example, given the 10 samples and their phenotype values in Fig. 1(a), we calculate the base score F₀ of grouping G₁(S′) = {{s₁, s₅}, {s₂, s₃}, {s₇, s₁₀}}.

\begin{array}{l} S Q_{0} (S_{1_{1}}^{'}) = 19106, S Q_{0} (S_{1_{2}}^{'}) = 16805, S Q_{0} (S_{1_{3}}^{'}) = 9000 . \\ S M_{0} (S_{1_{1}}^{'}) = 194, S M_{0} (S_{1_{2}}^{'}) = 183, S M_{0} (S_{1_{3}}^{'}) = 132 . \\ F_{0} (G_{1} (S')) = 547.17 / 212.17 = 2.58 . \end{array}

The SQ₀ and SM₀ values of the three subsets are then stored in Tree_subset. Given a parent-grouping of G₁(S′), G₂(S′) = {{s₁, s₅, s₂, s₃}, {s₇, s₁₀}}, we can retrieve the values of SQ₀ and SM₀ and use them to calculate F₀(G₂(S′)),

\begin{array}{l} S Q_{0} (S_{2_{1}}^{'}) = S Q_{0} (S_{1_{1}}^{'}) + S Q_{0} (S_{1_{2}}^{'}) = 35911, S Q_{0} (S_{2_{2}}^{'}) = S Q_{0} (S_{1_{3}}^{'}) . \\ S M_{0} (S_{2_{1}}^{'}) = S M_{0} (S_{1_{1}}^{'}) + S M_{0} (S_{1_{2}}^{'}) = 377, S M_{0} (S_{2_{2}}^{'}) = S M_{0} (S_{1_{3}}^{'}) . \\ F_{0} (G_{2} (S')) = 1064.08 / 166.69 = 6.38 . \end{array}

The reuse of SQ₀(S′) and SM₀(S′) between parent/child groupings may work in conjunction with the inTree effective permutation. Besides, SQ₀(S′) and SM₀(S′) can also be reused by any groupings that contain the subset S′.

5. Results

We compare TreeQA with the following algorithms: 1) SMA, our implementation of the Single Marker Association algorithm¹⁷^,⁵; 2) HAM, our implementation of the Haplotype Association Mapping algorithm¹⁵ that slides a 3-SNP window through the genome; 3) HapMiner¹⁰, downloaded from the website^e; and 4) TreeLD¹³, downloaded from the website^f. Both SMA and HAM use the oneway ANOVA test for fair comparison.

QHPM⁸ is not used for comparison because it is not scalable to large data sets. Blossoc¹⁴ and TreeDT¹⁸ are not used because they require categorical phenotypes.

5.1. Experiments on Simulated Data

We use Coasim¹¹ to simulate 1000 sequences with scaled recombination rate ρ = 400 that corresponds roughly to 10 cM. 10,000 SNP markers are placed uniformly at random over the sequences.

SNP markers on the sequences are randomly selected as causative loci with one, two and three causative mutations. The first SNP is always selected randomly from all SNPs. In the cases of two and three mutations, the second and third causative SNPs are selected from compatible SNPs that are located less than 10 SNPs away from the first SNP. Phenotype values are sampled from four Gaussian distributions: N₁(140, 35), N₂(90, 35), N₃(50, 40), and N₄(10, 35). The one-mutation case uses N₁ and N₃. The two-mutation case uses N₁, N₂ and N₃. The three-mutation case uses all four Gaussian distributions. After assigning the phenotype values, all causative SNPs are removed from the data and we randomly select 100 sequences for our experiments.

SMA, HAM and HapMiner output the top one scoring locus as a point estimation of the causative locus, while TreeQA outputs the top one compatible region. We compare the effectiveness of the algorithms by measuring the distance (in cM) from the top one scoring locus or the center of the top one region to the causative SNP (or the average distance to every causative SNP). We call the distance the Prediction error.

Since HapMiner can not finish processing 10,000 SNP markers in a reasonable time, we only use the first 1,000 markers of each sequence when applying HapMiner on the simulated data.

The comparison of SMA, HAM, HapMiner and TreeQA is shown in Figure 2. The x-axis represents the prediction error (distance) to the causative locus and the y-axis represents the percentage of causative loci which are found in distance less than x. In all three cases, the estimated loci by TreeQA are closer to the causative loci than those by SMA, HAM and HapMiner.

Comparison of SMA, HAM, HapMiner and TreeQA on the simulated data

The TreeLD algorithm uses local phylogenies and analyzes quantitative phenotypes. However, TreeLD can only process a very small amount of data in reasonable time. Therefore, we select 36 samples and 20 SNP markers from the simulated data for performance comparison. A one-mutation causative locus is selected from the 20 SNPs. For TreeQA, instead of generating maximal compatible regions as discussed in Sec. 4, a compatible region is generated around each SNP and contains up to five SNPs. TreeLD takes about two hours to analyze this small data while TreeQA finishes in seconds. Figure 3 plots the results from TreeLD and TreeQA. The x-axis represents the simulated positions in the genome and the y-axis represents the scores of the SNPs. The vertical line demonstrates the causative locus. TreeQA detects a peak near the causative locus while TreeLD identifies two spurious peaks.

Comparison of TreeLD and TreeQA on the simulated data

5.2. Experiments on Mouse Genotype Data

We used a set of mouse genotypes that combines experimental and imputed data^g ²⁰ from the Jackson Laboratory, consisting of 74 samples. The dataset contains over 7 million SNP markers distributed over all 20 chromosomes. We removed wild derived mouse inbred strains since they are quantitatively and qualitatively different than other laboratory inbred strains and we only used in our experiments the remaining 55 samples that have a share set of common ancestral relationships¹⁹.

We used high density lipoprotein cholesterol (HDL-C) levels in blood as the test phenotype, downloaded from the Mouse Phenome Database^h. Several HDL-C datasets are available, each of which was collected under different conditions, and are thus treated as separate phenotypes. Some candidate genes that may play a role in regulating HDL-C levels are reported in⁹.

We apply SMA, HAM and TreeQA on the data and examine how close they can identify the top peak near the locus of those candidate genes.

TreeQA detects top peaks near the locations for over 10 of the candidate genes⁹, including Ppara, Abcb4 and Rxrb. The top peaks reported by SMA and HAM are often far from the locations of these genes. Due to space limitation, we only show the results for one of them, Abcb4, in Figure 4.

Compare SMA, HAM and TreeQA on the mouse genotype data

The perfect phylogeny corresponding to the peak point (compatible region from 8799298 to 8801558 (base)) found by TreeQA is plotted in Fig. 5. The phenotype values of the samples are in parentheses. Samples with unknown phenotype values are omitted from the tree. The subtree on the right contains samples having high phenotype values while the subtree at the bottom contains samples having low values. Other subtrees are considered as outliers and are excluded from the grouping. SMA and HAM fail to identify the locus because they only examine sample groupings that can be generated from single SNPs or 3-SNP windows, which are a small subset of the groupings examined by TreeQA.

The perfect phylogeny at the peak point found by TreeQA in Figure 4

TreeQA takes about 10 minutes to analyze each chromosome which contains around 40000 SNPs on average. SMA and HAM take slightly less time than TreeQA. Both HapMiner and TreeLD are unable to finish in reasonable time.

6. Conclusion

In this paper, we present a tree-based quantitative GWA mapping algorithm, TreeQA. TreeQA utilizes local perfect phylogenies in detecting associations. Perfect phylogenies provide sensible groupings of samples at multiple resolutions. TreeQA explores the space of all possible groupings implied by the perfect phylogenies in a carefully designed order so that intermediate computations can be maximally reused. Our experimental results on both simulated and real data show that TreeQA can efficiently conduct quantitative GWA analysis and is more effective than the previous methods.

Footnotes

For example, the number of trees can exceed tens of thousands in a chromosome-wide association study. And there are up to 2²ⁿ⁻² groupings that can be generated from a tree of n samples.

With ɛ error rate, the risk of reporting at least one spurious association from x tests is 1 − (1 − ɛ)^x.

Considering groupings of a sample subset allows TreeQA to exclude potential outliers from the ANOVA test.

Any other ways of defining a total order of the subsets would also work.

http://vorlon.case.edu/jx1175/HapMiner.html

http://pritch.bsd.uchicago.edu/treeld.html

http://cgd.jax.org/ImputedSNPData/imputedSNPs.htm

http://phenome.jax.org/pub-cgi/phenome/mpdcgi?rtn=meas/catlister/req=Cblood+lipids

Contributor Information

Feng Pan, Email: panfeng@cs.unc.edu.

Leonard McMillan, Email: mcmillan@cs.unc.edu.

Fernando Pardo-Manuel de Villena, Email: fernando@med.unc.edu.

David Threadgill, Email: dwt@med.unc.edu.

Wei Wang, Email: weiwang@cs.unc.edu.

References

1.Miller RG. Simultaneous Statistical Inference. New York: Springer Verlag; 1981. [Google Scholar]
2.Hudson RR, Kaplan NL. Statistical properties of the number of recombination events in the history of a sample of dna sequences. Genetics. 1985;111(1):147C164. doi: 10.1093/genetics/111.1.147. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Agarwala R, Fernandez-Baca D, Slutzki G. Fast algorithms for inferring evolutionary trees. Journal of Computational Biology. 1995;2(3):397–408. doi: 10.1089/cmb.1995.2.397. [DOI] [PubMed] [Google Scholar]
4.Toivonen H, Onkamo P, Vasko K, et al. Data mining applied to linkage disequilibrium mapping. Am J Hum Genet. 2000;67:133–145. doi: 10.1086/302954. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Akey J, Jin L, Xiong M. Haplotypes vs single marker linkage disequilibrium tests: what do we gain? Eur J. Hum Genet. 2001;9(4):291–300. doi: 10.1038/sj.ejhg.5200619. [DOI] [PubMed] [Google Scholar]
6.Larribea F, Lessarda S, Schork NJ. Gene mapping via the ancestral recombination graph. Theoretical Population Biology. 2002;62(2):215–229. doi: 10.1006/tpbi.2002.1601. [DOI] [PubMed] [Google Scholar]
7.Morris AP, Whittaker JC, Balding DJ. Fine-scale mapping of disease loci via shattered coalescent modeling of genealogies. Am J Hum Genet. 2002;70(3) doi: 10.1086/339271. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Onkamo P, Ollikainen V, Sevon P, et al. Association analysis for quantitative traits by data mining: Qhpm. Ann. Hum. Genet. 2002;66:419–429. doi: 10.1017/S0003480002001318. [DOI] [PubMed] [Google Scholar]
9.Wang X, Paigen B. Quantitative trait loci and candidate genes regulating hdl cholesterol. Arteriosclerosis, Thrombosis, and Vascular Biology. 2002;22:1390. doi: 10.1161/01.atv.0000030201.29121.a3. [DOI] [PubMed] [Google Scholar]
10.Li J, Jiang T. Haplotype-based linkage disequilibrium mapping via direct data mining. Bioinformatics. 2005;21(24):4384–4393. doi: 10.1093/bioinformatics/bti732. [DOI] [PubMed] [Google Scholar]
11.Mailund T, Schierup MH, et al. Coasim: A flexible environment for simulating genetic data under coalescent model. BMC Bioinformatics. 2005;6(252) doi: 10.1186/1471-2105-6-252. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Waldron E, Whittaker J, Balding D. Fine mapping of disease genes via haplotype clustering. Genetic Epidemiology. 2005;30:170–179. doi: 10.1002/gepi.20134. [DOI] [PubMed] [Google Scholar]
13.Zöllner S, Pritchard JK. Coalescent-based association mapping and fine mapping of complex trait loci. Genetics. 2005;169(2):1071–1092. doi: 10.1534/genetics.104.031799. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Mailund T, Besenbacher S, Schierup MH. Whole genome association mapping by incompatibilities and local perfect phylogenies. BMC Bioinformatics. 2006;7:454. doi: 10.1186/1471-2105-7-454. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.McClurg P, Pletcher MT, Wiltshire T, Su AI. Comparative analysis of haplotype association mapping algorithms. BMC Bioinformatics. 2006;7:61. doi: 10.1186/1471-2105-7-61. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Minichiello MJ, Durbin R. Mapping Trait Loci by Use of Inferred Ancestral Recombination Graphs. Am J Hum Genet. 2006;79(5):910–922. doi: 10.1086/508901. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Pe’er I, et al. Evaluating and improving power in whole-genome association studies using fixed marker sets. Nature Genetics. 2006;38:663–667. doi: 10.1038/ng1816. [DOI] [PubMed] [Google Scholar]
18.Sevon P, Toivonen H, Ollikainen V. Treedt: Tree pattern mining for gene mapping. IEEE Transactions on Computational Biology and Bioinformatics. 2006;3(2) doi: 10.1109/TCBB.2006.28. [DOI] [PubMed] [Google Scholar]
19.Yang H, Bell TA, Churchill GA, de Villena FP-M. On the subspecific origin of the laboratory mouse. Nature Genetics. 2007;39 doi: 10.1038/ng2087. [DOI] [PubMed] [Google Scholar]
20.Szatkiewicz JP, Beane GL, Ding Y, Hutchins L, de Villena FP-M, Churchill GA. An imputed genotype resource for the laboratory mouse. Mammalian Genome. 2008;19(3) doi: 10.1007/s00335-008-9098-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Cormen TH, Leiserson CE, Rivest RL, Stein C. Introduction to algorithms [Google Scholar]

[R1] 1.Miller RG. Simultaneous Statistical Inference. New York: Springer Verlag; 1981. [Google Scholar]

[R2] 2.Hudson RR, Kaplan NL. Statistical properties of the number of recombination events in the history of a sample of dna sequences. Genetics. 1985;111(1):147C164. doi: 10.1093/genetics/111.1.147. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] 3.Agarwala R, Fernandez-Baca D, Slutzki G. Fast algorithms for inferring evolutionary trees. Journal of Computational Biology. 1995;2(3):397–408. doi: 10.1089/cmb.1995.2.397. [DOI] [PubMed] [Google Scholar]

[R4] 4.Toivonen H, Onkamo P, Vasko K, et al. Data mining applied to linkage disequilibrium mapping. Am J Hum Genet. 2000;67:133–145. doi: 10.1086/302954. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] 5.Akey J, Jin L, Xiong M. Haplotypes vs single marker linkage disequilibrium tests: what do we gain? Eur J. Hum Genet. 2001;9(4):291–300. doi: 10.1038/sj.ejhg.5200619. [DOI] [PubMed] [Google Scholar]

[R6] 6.Larribea F, Lessarda S, Schork NJ. Gene mapping via the ancestral recombination graph. Theoretical Population Biology. 2002;62(2):215–229. doi: 10.1006/tpbi.2002.1601. [DOI] [PubMed] [Google Scholar]

[R7] 7.Morris AP, Whittaker JC, Balding DJ. Fine-scale mapping of disease loci via shattered coalescent modeling of genealogies. Am J Hum Genet. 2002;70(3) doi: 10.1086/339271. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] 8.Onkamo P, Ollikainen V, Sevon P, et al. Association analysis for quantitative traits by data mining: Qhpm. Ann. Hum. Genet. 2002;66:419–429. doi: 10.1017/S0003480002001318. [DOI] [PubMed] [Google Scholar]

[R9] 9.Wang X, Paigen B. Quantitative trait loci and candidate genes regulating hdl cholesterol. Arteriosclerosis, Thrombosis, and Vascular Biology. 2002;22:1390. doi: 10.1161/01.atv.0000030201.29121.a3. [DOI] [PubMed] [Google Scholar]

[R10] 10.Li J, Jiang T. Haplotype-based linkage disequilibrium mapping via direct data mining. Bioinformatics. 2005;21(24):4384–4393. doi: 10.1093/bioinformatics/bti732. [DOI] [PubMed] [Google Scholar]

[R11] 11.Mailund T, Schierup MH, et al. Coasim: A flexible environment for simulating genetic data under coalescent model. BMC Bioinformatics. 2005;6(252) doi: 10.1186/1471-2105-6-252. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] 12.Waldron E, Whittaker J, Balding D. Fine mapping of disease genes via haplotype clustering. Genetic Epidemiology. 2005;30:170–179. doi: 10.1002/gepi.20134. [DOI] [PubMed] [Google Scholar]

[R13] 13.Zöllner S, Pritchard JK. Coalescent-based association mapping and fine mapping of complex trait loci. Genetics. 2005;169(2):1071–1092. doi: 10.1534/genetics.104.031799. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] 14.Mailund T, Besenbacher S, Schierup MH. Whole genome association mapping by incompatibilities and local perfect phylogenies. BMC Bioinformatics. 2006;7:454. doi: 10.1186/1471-2105-7-454. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] 15.McClurg P, Pletcher MT, Wiltshire T, Su AI. Comparative analysis of haplotype association mapping algorithms. BMC Bioinformatics. 2006;7:61. doi: 10.1186/1471-2105-7-61. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] 16.Minichiello MJ, Durbin R. Mapping Trait Loci by Use of Inferred Ancestral Recombination Graphs. Am J Hum Genet. 2006;79(5):910–922. doi: 10.1086/508901. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] 17.Pe’er I, et al. Evaluating and improving power in whole-genome association studies using fixed marker sets. Nature Genetics. 2006;38:663–667. doi: 10.1038/ng1816. [DOI] [PubMed] [Google Scholar]

[R18] 18.Sevon P, Toivonen H, Ollikainen V. Treedt: Tree pattern mining for gene mapping. IEEE Transactions on Computational Biology and Bioinformatics. 2006;3(2) doi: 10.1109/TCBB.2006.28. [DOI] [PubMed] [Google Scholar]

[R19] 19.Yang H, Bell TA, Churchill GA, de Villena FP-M. On the subspecific origin of the laboratory mouse. Nature Genetics. 2007;39 doi: 10.1038/ng2087. [DOI] [PubMed] [Google Scholar]

[R20] 20.Szatkiewicz JP, Beane GL, Ding Y, Hutchins L, de Villena FP-M, Churchill GA. An imputed genotype resource for the laboratory mouse. Mammalian Genome. 2008;19(3) doi: 10.1007/s00335-008-9098-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] 21.Cormen TH, Leiserson CE, Rivest RL, Stein C. Introduction to algorithms [Google Scholar]

PERMALINK

TREEQA: QUANTITATIVE GENOME WIDE ASSOCIATION MAPPING USING LOCAL PERFECT PHYLOGENY TREES

Feng Pan

Leonard McMillan

Fernando Pardo-Manuel de Villena

David Threadgill

Wei Wang

Abstract

1. Introduction

2. Related Work

3. Preliminaries

Figure 1.

Compatible region

Maximal Compatible region

Perfect Phylogeny Tree

Definition 3.1

Definition 3.2

Table 1.

Association between a Compatible Region and a Phenotype

Definition 3.3. The association between a compatible region and a phenotype

Problem Definition

4. TreeQA Algorithm

4.1. Maximal Compatible Region and Phylogeny Construction

4.2. Association Computing

4.3. Effective Permutation

4.3.1. inTree: Effective permutation tests within a tree

4.3.2. amgTree: Effective permutation among trees

4.4. Reuse of Intermediate Computation of Statistical Tests

5. Results

5.1. Experiments on Simulated Data

Figure 2.

Figure 3.

5.2. Experiments on Mouse Genotype Data

Figure 4.

Figure 5.

6. Conclusion

Footnotes

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases