The Limits of Fine-Scale Mapping

Lucian P Smith; Mary K Kuhner

doi:10.1002/gepi.20387

. Author manuscript; available in PMC: 2010 May 1.

Published in final edited form as: Genet Epidemiol. 2009 May;33(4):344–356. doi: 10.1002/gepi.20387

The Limits of Fine-Scale Mapping

Lucian P Smith ^*, Mary K Kuhner ^*

PMCID: PMC2707455 NIHMSID: NIHMS94322 PMID: 19048633

Abstract

When a novel genetic trait arises in a population, it introduces a signal in the haplotype distribution of that population. Through recombination, that signal’s history becomes differentiated from the DNA distant to it, but remains similar to the DNA close by. Fine-scale mapping techniques rely on this differentiation to pinpoint trait loci. In this study, we analyzed the differentiation itself to better understand how much information is available to these techniques. Simulated alleles on known recombinant coalescent trees show the upper limit for fine-scale mapping. Varying characteristics of the population being studied increase or decrease this limit. The initial uncertainty in map position has the most direct influence on the final precision of the estimate, with wider initial areas resulting in wider final estimates, though the increase is sigmoidal rather than linear. The θ of the trait (4Nμ) is also important, with lower values for θ resulting in greater precision of trait placement up to a point—the increase is sigmoidal as θ decreases. Collecting data from more individuals can increase precision, though only logarithmically with the total number of individuals, so that each added individual contributes less to the final precision. However, a case/control analysis has the potential to greatly increase the effective number of individuals, as the bulk of the information lies in the differential between affected and unaffected genotypes. If haplotypes are unknown due to incomplete penetrance, much information is lost, with more information lost the less indicative phenotype is of the underlying genotype.

Keywords: fine mapping, coalescent analysis, ancestral recombination graphs, linkage disequilibrium mapping, maximum likelihood

INTRODUCTION

Through recombination, the history of sites surrounding a novel genetic trait become distinct from the history of the sites distant from that trait. The variation in the genetic sequences of individuals with and without the trait can allow researchers to distinguish sites with similar histories from sites with dissimilar histories, allowing the trait to be mapped, and, hopefully, revealing the genetic basis for the observed trait. This similarity in genetic history between causative alleles and nearby alleles is reflected in linkage disequilibrium, and has been used to map a wide variety of human genetic diseases [Weiss and Terwilliger, 2000]. Modern studies often use genome-wide linkage disequilibrium to map the diverse causes of complex diseases [Badano and Katsanis 2002], but even here, each contributory allele is discovered through an analysis of nearby SNPs, an analysis that works because of the shared genetic history of the SNP and the causative allele. Ideally, one would use SNPs ascertained from the sampled individuals, but imputed SNPs from panels can also be used. The only difference is that the ascertainment scheme for imputed SNPs is often more complex, and determining an appropriate correction is difficult [Kuhner, et al. 2000].

The strongest signal for the trait in both simple and complex diseases is, of course, the causative allele itself. If it is known that this allele has been sequenced, the genetic history of the sampled population becomes irrelevant, and association studies can be applied directly to the sequence data [Felsenstein, in prep]. But if that allele has not been sequenced, or if it is unknown whether it has been sequenced, the surrounding polymorphisms can be used to extrapolate the genetic history of the sampled individuals, and we can look for the signal in the patterns of co-inheritance. The pattern of co-inheritance is codified explicitly and completely in the recombinant coalescent tree, or ancestral recombination graph. In this study, we elected to study that tree itself, rather than reconstructions of it, to determine the maximum amount of information any measure of co-inheritance could provide. Figure 1 illustrates one such very simple tree, and shows how mapping can be performed directly on the tree instead of using SNPs. All mapping methodologies, from basic linkage disequilibrium to the more complex methods discussed below, are ways to get at the information present in this true tree.

In a), we see a simple recombinant coalescent tree (ancestral recombination graph) with the black history separated from the grey history by two recombination events. In b), these have been separated into two ‘interval trees’. The observed data (squares and circles) are more likely to be observed on the right (grey) tree than the left (black) tree, since only one mutation event is needed to explain the data, not two.

Many mapping analyses begin with linkage analysis of families to identify chromosomal regions that are associated with the trait. These studies typically narrow the area where the trait locus must be to a region several centimorgans in size, which may contain hundreds of genes. To further pinpoint the location of the trait locus, fine-scale mapping techniques must be used. The simplest, oldest, and most common fine-scale mapping technique is to use linkage disequilibrium measures. The efficacy of various traditional disequilibrium measures was reviewed by Devlin and Risen [1995]. These measures are point estimates of how closely a particular genetic polymorphism matches the polymorphism of the phenotype. Haplotype Data Mining [Toivonen et al. 2000], available as the program TreeDT, looks for recurrent marker patterns in the data, and other programs are available that similarly look for other patterns of coinheritance. CLADHC [Durrant et al. 2004] collects patterns of variance into cladograms to look for association in that way.

Other techniques and programs are available that take a more genealogical approach to fine-scale mapping. McPeek and Strahs [1999] refined the basic disequilibrium measures for use in multilocus studies by modeling the decay of haplotype sharing (DHS), implemented in the program DHSMAP. Other algorithms have followed, each with different models of the underlying process and using different techniques to extract the process from the given data. COLDMAP [Morris, et al. 2000; 2002] and GeneRecon [Mailund et al. 2006] use the ‘Shattered Coalescent’ to estimate allele locations. BLADE [Liu et al. 2001] uses a Bayesian analysis that allows for multiple ancestral haplotypes. DLME+ [Rannala and Reeve 2001; Reeve and Rannala 2002] models a variety of data types (including RFLPs) and uses the annotated human genome sequence to construct a prior for allele location. LATAG [Zöllner and Pritchard 2005] performs interval-based coalescent reconstructions which are recombination-aware, but do not have to model the entire recombinant coalescent tree. Several of the approaches above are reviewed by Molitor et al. [2004].

On a larger scale, the program ‘Margarita’ [Minichiello and Durbin 2006] collects plausible recombinant coalescent trees for large amounts of data for entire chromosomes. This resembles the fine-scale mapping programs above in its attempt to reconstruct plausible genetic histories for the trait alleles, but applies this to the question of large-scale instead of fine-scale mapping.

All these methods rely on the genetic history of the trait being studied, though some address this reliance directly and some only indirectly. Linkage disequilibrium is an observable result of the same genetic process that is the ultimate cause of the present-day distribution of trait alleles. As such, it can be a very useful tool to use to track down the location of those alleles. (For a review that delves into more detail on this, see Nordborg and Tavaré [2002].)

When the causative allele itself has not been sequenced, all remaining evidence is indirect: nearby sequenced polymorphisms provide information about the ancestral pattern of genetic inheritance, and that pattern provides information about the most likely site for the trait locus. The accuracy of all fine-scale mapping analyses relies on two factors: whether the technique accurately recovers the pattern of inheritance, and whether that pattern distinguishes between the potential locations. Some techniques perform both steps at once, and do not include explicit reconstructions, but even these, by necessity, are limited by these two factors. The pattern of inheritance is codified explicitly and completely in the recombinant coalescent tree. Our use of known trees with known trait alleles simulates a ‘perfect’ technique, able to completely capture the inheritance pattern. The power of this tree to map the trait gives an upper limit to the precision any technique can achieve, and thus provides a ‘gold standard’ by which any method from the simple linkage disequilibrium estimates to the more complicated tree-based methods may be judged.

In addition, we examine how characteristics of the population being studied can influence this upper limit. Our results will help researchers determine beforehand if attempting to map a particular trait in a particular population is hopeless or promising, and what data collection strategy will be most effective.

MATERIALS & METHODS

Simulations

Recombinant coalescent trees were simulated via algorithms first developed for the program Recombine [Kuhner et al. 2000], under a variety of simulation parameters. The number of sites (l), the number of haplotypes sampled, the values of θ, and recombination rate (r) were all varied systematically. Population parameters were defined as follows: θ=4Nμ, with N the effective population size and μ the mutation rate in mutations per generation for the trait alleles. r=C/μ, with C the number of recombination events per pair of adjacent sites per generation. We found that the summary statistic θrl = 4NCl, a map length parameter scaled by the population size, was helpful when comparing different analyses. We therefore present results using this statistic divided by 400, which corresponds to distances in centimorgans for humans. Results for other organisms can be obtained by multiplying these results by the ratio of the other organism’s effective population size to the effective population size of humans, assumed here to be 10,000.

Simulation of the alleles at the trait locus was performed by one of three methods. In the first method, a site and an ancestral state were chosen for the trait at random and then allowed to mutate using a symmetrical two-state model. If this resulted in invariant or nearly-invariant data (defined as having less than three samples with the minority allele), the data set was discarded and simulated again at the same site on the same tree.

In the second method, used in the case/control analyses, trees of 40-100 tips were simulated under various parameter values and trait data were simulated on them as above. If the tree happened to have exactly 20 tips with the minority allele, it was saved; otherwise, the entire tree was discarded. The 20 minority allele tips (the cases) and a randomly-selected 20 majority allele tips (the controls) were saved, the data from the remaining tips were discarded, and the resulting data sets were analyzed. A given set of trees modeled the case where the minority allele was found at a given frequency in the general population, but for which an equal number of cases and controls were collected.

In the third method, used in the penetrance analyses, DNA was first simulated at all sites under the F84 model [Kishino and Hasegawa 1989] for DNA mutation with equal base frequencies and a transition/transversion ratio of 2.0. A site was chosen at random from the variable sites where the minority allele(s) had a total frequency of at least three samples. The majority allele was then marked as a single state, and all other alleles were assigned to the other state. The remaining simulated data were not used; since the true tree was known, there was no need for tree inference.

In all these methods, we assume that while a trait may be caused by multiple events, the locations of these events were not separated by recombination in the history of the sampled population. For analyses with penetrance models, pairs of samples were randomly combined into individuals, who were then assigned a phenotype based on their simulated genotype. When individuals with a single genotype could potentially exhibit multiple phenotypes, a phenotype was chosen at random based on the penetrance model.

Likelihood calculation

To map the simulated trait alleles, the likelihood that the data (either known alleles or known phenotypes) would be produced on the known tree at a given site (the ‘data likelihood’) was calculated for each candidate trait locus position using a symmetrical two-state data model with peeling algorithm of Felsenstein [1981]. This model was used instead of a DNA mutation model to more closely imitate traits for which a variety of mutational events might cause the trait. We used this model for all experiments, including those where the trait data was created by converting simulated DNA to two states. Assuming a uniform prior probability over all candidate positions allowed us to convert these likelihoods to posterior probabilities. The sites with the highest probability of containing the trait were then collected until the total inferred probability that the true site had been collected was 95%. (These sites were not required to be contiguous.) The number of sites in this collection measures the precision of the mapping attempt.

To calculate the likelihood for ambiguous data (for individuals whose phenotypes allowed more than one genotype, or unphased haplotypes), the probabilities of all possible haplotypes were summed (Appendix A). For data sets with more than one individual displaying an ambiguous phenotype, each combination of resolved haplotypes had to be summed. This method goes beyond most previous fine-mapping analyses, where either the data is computationally phased by a program such as PHASE [Stephens and Donnelly 2003] before the mapping analysis, or different resolutions of the ambiguous data are sampled during the analysis [Kuhner et al. 2000]. Our method is more precise, but computationally intensive, particularly in cases with many unphased individuals. The number of calculations that must be performed is proportional to S^N, where S is the number of ambiguous states, and N is the number of individuals with ambiguous phenotypes. We found a novel method to reduce this computational burden by ‘collapsing’ haplotypes so that the maximum number of ambiguous states for diploid individuals is two. This method is described in detail in Appendix A.

Amount of recombination

The number of expected recombination events in a tree depends on a complex interaction between the recombination rate, the number of sites under consideration (minus one), and the number of tips in the tree. The equation for the simplest case (two tips and two sites) is presented in Appendix B. The solution for more complicated cases is sufficiently arcane (the two-tip three-site case involves equilibria between 16 states instead of just three), that we approximated it using Monte Carlo simulation of trees with different values for r, l, and number of samples. Trees were simulated using an implementation of Hudson’s simulator [Hudson 1983] (a recombinant coalescent tree simulator). This simulator is ‘final coalescent’ aware, meaning that recombination events that only affect lineages whose common ancestor has been reached are ignored (as they cannot affect the present-day data).

Software

For experimental conditions where the summary statistic 4NCl was 20 or less (0.05 cM), a modified version of the LAMARC program [Kuhner 2006] was used to create trees, simulate data on those trees, and calculate the likelihood of the simulated data. For experiments involving 4NCl greater than 20, for efficiency a series of programs were used in concert—an algorithm based on the Hudson simulator [Hudson 1983] to create trees, an external simple program to generate trait data on those trees, the PHYLIP program ‘dnamlk’ [Felsenstein 2005] to calculate data likelihoods, and a Perl script to perform the final mapping analysis. These two implementations produced identical results from the same starting conditions, and both followed the same underlying algorithms.

Analysis

1000 replicate experiments were performed for each analyzed parameter combinations, with trees constructed, data simulated, and likelihoods assessed. When multiple differently-penetrant trait models were compared under the same conditions (population size, recombination rate, etc.), the same trees and simulated data were used for both, differing only in the assignment of phenotypes to the simulated genotypes.

Each replicate experiment resulted in a set of the most probable locations of the trait in question which collectively had a 95% probability of including the truth (the ‘final map length’). The more informative the data, the smaller the final map length. The average number of sites included over the 1000 experiments is therefore an estimate of the amount of information present. These results are given in centimorgans (cM), scaled to a population with an effective size of 10,000 (such as humans).

RESULTS

Within each 1000-replicate study, results varied widely. Even under the least-informative conditions, the final map length was sometimes small, and even under the most-informative conditions, it was sometimes large. One practical message is that the success of a mapping attempt is not guaranteed even under optimal conditions, nor is failure guaranteed by non-optimal ones.

Figure 2 shows a graph of a representative experiment where the increase in information from adding more samples was tested. Each point on a line shows the number of replicate experiments whose final map length was the given distance or shorter. Each line starts close to zero (representing the most informative simulation of the 1000) and goes to 95% of the original map length (0.025), representing simulations with no information at all (one can be 95% certain of including the correct site by simply excluding a random 5% of the sample). The differences between experimental conditions can be seen in how fast the line changes from being very informative to being minimally informative. In some of our simulations, the shape of this distribution deviated from the typical ‘vibrating string’ seen in Figure 2, but when it did not, the average map length is reported.

Simulation results from experiments with 1000 replicates. Each line tracks the number of simulations whose final estimate of the location of the trait allele contained greater than or equal to the given percentage of sites. Simulations were performed with r=0.015, Θ=1.0, and l=1000, and results are displayed assuming 4N=40,000.

Different experimental conditions can therefore be compared to see which contain more information about the location of the trait. As a result, knowing the population parameters that influenced the history of a trait can give us a fair idea of how successful we might be in mapping it. The parameters studied here are map distance, θ, the length of the stretch of DNA where the locus might reside, the number of individuals sampled, and the effect of systematic oversampling of cases versus controls.

Map distance

Without recombination, disequilibrium mapping would be impossible. The total amount of recombination over the region to be mapped strongly influences how much power is available to map any trait. A mapping study with a large map distance to search has more information available to pinpoint the location. However, this information is spread out over a longer distance, which results in longer final map distances. Figure 3 shows the correlation between initial map distance and the final map distance in centimorgans. For low numbers of samples, the correlation is roughly linear (on a log-log plot), but somewhat sigmoidal for more samples.

θ and l were kept constant for these experiments at 1.0 and 1,000,000, respectively. Decreasing l to 1,000 and increasing r by a factor of 1,000 (leaving the total map distance constant) produced nearly identical results for all conditions tested (data not shown).

Population size and mutation rate

θ is a measure of the genetic diversity present in a population, increasing with larger populations and with higher mutation rates. The θ for the trait itself, which we use here, may be different from the θ for the markers in the same genomic region if the mutation rates differ. For example, a disease whose alleles are active and inactive forms of a gene may have a much higher mutation rate than a single base pair, since there are many different ways to inactivate a gene. Similarly, a trait solely caused by a deletion event may have a lower mutation rate than the single-base substitution rate.

In this study, we considered only the informativeness of the underlying trait genealogy, and therefore considered only the trait θ. Marker θ will of course affect the success of attempts to infer the genealogy. Our results assume perfect inference and therefore represent an upper bound on mapping precision.

In trees with the same amount of expected recombination (the same number of samples and same recombination distance), a lower θ for the trait meant a narrower confidence interval, to a point. This effect followed a sigmoidal pattern, seen in Figure 4. Shown are plots of final map distances vs. θ for trees of 10 samples (squares) and 18 samples (triangles), calculated from an initial map distance of 0.025 cM. All points are averages of 1000 replicates.

The effect of Θ on fine mapping precision. Data collected with l=1000, and r chosen for each point such that the total map distance was 0.025 cM, assuming N= 10,000. Displayed curves are the best fit sigmoidal curves for the given data.

Number of samples

Collecting data from more individuals is one obvious tactic to try to gain precision, given that most other factors that influence the amount of information in the sample are outside the researcher’s control. Figure 5 shows the same simulation results as Figure 3, this time with the with the final map distance plotted against the number of samples, for several different values of the original map distances. More samples increase precision, with the effect more pronounced when the original map length is greater.

However, we have already seen that increasing recombination events increases the precision of the estimate, and we know that adding samples will increase the number of recombination events. How much of the added precision with increased samples is due to the increased number of recombination events, and how much is due to the new information contained in the new samples?

To separate these two conditions, we performed simulations with the initial map length chosen such that the expected average number of recombination events remained constant (between 50.3 and 50.5) for each tested number of samples, as determined by Monte Carlo simulation using the Hudson simulator. Figure 6 shows the results of these simulations, and compares these results to the previous experiment where the initial map length remained constant as the number of samples changed. (An initial map distance of 0.0125 cM with 10 samples results in approximately 50.4 average expected recombinations.) As can be seen, the additional recombinations present in a tree with more samples accounts for about 15% of the increase in precision, leaving 85% due to the added information present in the increased number of samples.

The effect of the number of recombination events on fine mapping precision. The 0.0125 cM and 0.005 cM data are repeated from Figure 5, and for the diamonds, the total cM for each point was chosen such that the expected number of recombinations for that number of samples with Θ=1.0 and l=1000 was ~50.4 (determined by Monte Carlo simulation using Hudson’s tree generator).

Oversampling

However, many researchers do not collect randomly sampled data from the population at large, but instead collect a given number of cases and controls irrespective of the allele frequency in the general population. To study the effect of this methodology, we performed a series of mapping simulations where 20 cases and 20 controls were collected from a population under a range of minor-allele frequencies in the general population, under a range of initial map distances. These results are shown in Figure 7, and show a mapping precision increase with the increase of minor allele frequency. The variability of these results was smaller than for the analyses where all the simulated data was examined (variance data not shown). Computational limits prevented us from simulating minor allele frequencies lower than 20%.

The effect of the minor allele frequency on the final map distance for trees with 20 cases (samples with the minor allele) and 20 controls (samples with the major allele). Data collected with Θ=1.0 and l=1,000,000, varying r. Results are shown assuming N= 10,000.

In addition, we also analyzed our case/control trees with the original data, i.e. with the same 20 cases, but using all the controls instead of just the 20 we randomly sampled (in the 40-tip case, these are identical to the case/control analysis). These represent a study where samples were chosen until 20 cases were found, and then analyzed. A comparison of the two methods is shown in Figure 8, which shows the difference in accuracy between the two methods in centimorgans.

Comparison of results from a case/control study of data where only 20 cases and 20 controls were chosen from a randomly-generated (as shown in Figure 7) to analysis of the same data with all samples included. Data collected as in Figure 7.

As can be seen, the difference in accuracy between the two measures is relatively small (~0.006 cM even in the worse case), though inversely correlated with the frequency of the minor allele.

Penetrance

All the experiments thus far have assumed that all trait alleles are fully haplotyped. In reality, this is seldom the case. Even if all homozygotes and heterozygotes have unique phenotypes (the codominant case) or are otherwise distinguishable from one another (as through pedigree studies), the phase of the heterozygote can seldom be determined. There is also the issue of incomplete penetrance. These phenomena clearly cause loss of information; the question is: how much?

We studied a variety of penetrance cases, and compared how increasing sample size (N) affected the results. Unfortunately, with computational complexity increasing as 2^N, we were only able to obtain results for 1,000 replicates up to the case where N=32. Figure 9 shows the results for several partially-penetrant cases as compared to the fully-haplotyped case. Eight cases contain results for simulations with varying degrees of multiplicative penetrance. Eight more contain results where only the heterozygote was partially penetrant, in order to get a handle on where information is encoded in the data.

The effect of different penetrance models on fine mapping precision. The numbers in the legend indicate the penetrance of one phenotype for homozygous-common, heterozygous, and homozygous-uncommon individuals, respectively. The top eight cases follow multiplicative penetrance, while the next seven are for cases where only the heterozygote is partially penetrant. Data collected with Θ=0.1, r=0.1 , and l=1000 (0.025 cM, assuming N= 10,000).

All tested penetrance cases with a multiplicative penetrance model lost the vast majority of their mapping precision (92-99% in the 80:60:20 case). The less extreme models where only the heterozygote was partially penetrant did better, but still lost precision (25-62% in the 100:20:0 case).

DISCUSSION

At the most basic level, any site that has always been co-inherited with a candidate site throughout the history of our sampled sequences cannot be distinguished from it as a potential candidate for the location of the trait allele. (We will refer to the coalescent tree for a set of co-inherited sites uninterrupted by recombination anywhere in the complete ancestral recombination graph as an ‘interval tree’.) This means that the minimum mappable length is the length of the interval tree containing the trait allele. As uncertainty increases, more intervals must be included, decreasing the precision of our estimates and increasing the final map length. (The researcher’s ability to accurately reconstruct the interval trees goes down with the number of variable sites in each interval, as there is less data to work with. This is an important consideration for real-world analyses, but is beyond the scope of this study.)

We can compare how probable it is that the observed pattern of data (i.e. which individuals display which traits) would be produced by each interval tree, and thereby distinguish the best interval trees from the worst. The extent to which this ‘data likelihood’ distinguishes the interval trees from one another ultimately determines how precise the estimate of the trait location will be. This means that the true interval tree must be sufficiently distinct from the other interval trees, and that the mutation(s) that caused the trait differentiation must have arisen within a part of that interval tree with a unique set of descendants when compared to other interval trees.

Interval tree width, interval tree distinctiveness, and the specifics of mutational events all contribute to the amount of signal available in the tree. The total map distance, the trait-locus θ, the amount and type of data collected, and the penetrance of the causative allele all affect these constraints in different ways.