Skip to main content
Briefings in Bioinformatics logoLink to Briefings in Bioinformatics
. 2016 Mar 10;18(2):195–204. doi: 10.1093/bib/bbw006

Multi-marker linkage disequilibrium mapping of quantitative trait loci

Soyoun Lee 1, Jie Yang 2, Jiayu Huang 1, Hao Chen 1, Wei Hou 2,, Song Wu 1,
PMCID: PMC5862291  PMID: 26966282

Abstract

Single nucleotide polymorphisms (SNPs), the most common genetic markers in genome-wide association studies, are usually in linkage disequilibrium (LD) with each other within a small genomic region. Both single- and two-marker-based LD mapping methods have been developed by taking advantage of the LD structures. In this study, a more general LD mapping framework with an arbitrary number of markers has been developed to further improve LD mapping and its detection power. This method is referred as multi-marker linkage disequilibrium mapping (mmLD). For the parameter estimation, we implemented a two-phase estimation procedure: first, haplotype frequencies were estimated for known markers; then, haplotype frequencies were updated to include the unknown quantitative trait loci based on estimates from the first step. For the hypothesis testing, we proposed a novel sequential likelihood ratio test procedure, which iteratively removed haplotypes with zero frequency and subsequently determined the proper degree of freedom. To compare the proposed mmLD method with other existing mapping methods, e.g. the adjusted single-marker LD mapping and the SKAT_C, we performed extensive simulations under various scenarios. The simulation results demonstrated that the mmLD has the same or higher power than the existing methods, while maintaining the correct type I errors. We further applied the mmLD to a public data set, ‘GAW17’, to investigate its applicability. The result showed the good performance of mmLD. We concluded that this improved mmLD method will be useful for future genome-wide association studies and genetic association analyses.

Keywords: linkage disequilibrium, genetic mapping, genome-wide association studies, quantitative trait loci, multi-marker LD, sequential testing

Introduction

Recent genetic studies have shown that many biologically and clinically important traits, such as complex diseases, are likely affected by complicated genetic architectures [1]. That is, multiple genes are involved in the development of these traits, exerting their influences singly or interactively with others [2–4]. For continuous traits, which are quantifiable and therefore referred as quantitative traits, their controlling genes are called quantitative trait loci (QTL). To detect the association between these traits and their QTL, many useful methods have been developed [2, 5–9]. While most of these traditional methods are based on tests constructed on single markers, more recent genome-wide association studies (GWASs) have focused on tests that are based on multiple markers, such as regression-based methods [10, 11] or single nucleotide polymorphism (SNP) selection with penalty approaches [12].

Based on the most recent dbSNP database (http://www.ncbi.nlm.nih.gov/projects/SNP/snp_summary.cgi), it is estimated that there are about 97 million SNPs in human genome. Although it is possible to survey these genetic variations in whole genomes through high-throughput sequencing techniques, its cost is still too high for large-scale studies. So presently, the most popular platform for SNP studies concerning thousands of subjects is still through SNP arrays, which sample about 1 million SNPs. This seems to represent only a small percentage of total SNPs; however, well-designed arrays are able to conserve a large percentage of genomic information. The reason is that SNPs in the genome are not independent of each other, rather they are locally linked and form the so-called linkage disequilibrium (LD) blocks. This phenomenon of LD describes the deviation of the observed frequency of a haplotype from random associations [1]. Owing to this correlated structure, the sampled SNPs would carry information about those unsampled ones and represent their presences in the genome [2]. The LD structures arise because when genetic markers are passed from generation to generation, markers that are close to each other tend to be inherited together (maybe disrupted by chromosomal crossover), which consequently make markers, or SNPs, correlated with each other in a neighborhood.

Based on the LD structures, a group of methods that capitalize on directly incorporating the LD information into genetic mapping have been developed, and therefore, they are referred as LD mapping [2–6, 8, 13, 14]. Both single- and two-marker LD mapping have been established, and the two-marker LD mapping showed considerably higher power than single-marker LD mapping [2]. Logically, LD mappings with multiple markers (3) are expected to be even more powerful than the two-marker method, as more markers may carry more information about the QTL and subsequently draw better inferences. However, multi-marker LD mapping (mmLD) has not been developed yet, owing to several challenges when extending the mapping from two to more markers, e.g. how to properly determine the degrees of freedom of the test. Also, as more markers are included in the model, marginal effects of additional markers would diminish gradually; therefore, it is also interesting to know how many markers would be sufficient for the efficient implementation of mmLD. In this study, we will address these issues and propose a LD mapping framework suitable for an arbitrary number of SNPs.

The rest of the article is organized as follows: we first describe the general framework of the mmLD, with detailed procedures on parameter estimation and hypothesis testing; then the performance of mmLD will be evaluated through extensive simulation studies under various scenarios, and compared with other existing methods; lastly, we will demonstrate the application of the mmLD using a real-data example.

Methods

General setting for the mmLD

Let us assume a univariate phenotypic trait, which is controlled by a causal QTL with two alleles (Q and q), and the allele frequencies of Q and q are expressed as pQ and pq=1pQ. Suppose that the QTL is genetically linked to a group of SNP markers, M (=1,,k,k>2), forming a LD block in which each marker M  has two alleles M and m with corresponding frequencies of  p and 1p. Then, the k markers may form 2k possible haplotypes, or 2k+1 possible joint haplotypes together with the QTL. Let pM1Mk denote the frequency of haplotypes formed by the k markers, and pM1Mk,Q the frequency of the joint haplotype by the k markers and the QTL.

To characterize the LD between the QTL and the  k markers, i.e. the deviation of the observed frequency of the joint haplotype from random association of QTL and markers is specified in the following equation:

DM1Mk,Q=pM1Mk,Q pM1MkpQ  (1)

In general, there are 2k1, DM1Mk,Q values, one for each marker haplotype. If all DM1Mk,Q s are zeros, it indicates that the QTL and the marker group are independent, therefore no QTL effect for the tested marker group; otherwise, the QTL and the marker group are in LD, indicating a QTL may exist in the neighborhood of the marker group.

Likelihood function of the mmLD

Suppose a random sample of size n is drawn from a nature population under Hardy–Weinberg Equilibrium (HWE) [15]. For subject i (i=1,, n), k markers (Mi1Mik)  have been genotyped and a continuous phenotypic trait (yi) has also been obtained. Assume the continuous trait is directly affected by the QTL. Then, the relationship between the observed phenotypes and their expected means, as determined by the QTL genotypes, can be described in the following mixture model:

yi=j=02ξijμj+ei,    i=1,, n (2)

where ξij is an indicator variable defined as 1 if subject i has a QTL genotype j (2 for QQ, 1 for Qq, and 0 for qq), μj is the expected phenotypic mean for a QTL genotype j, and ei is the error term that is assumed to follow a Gaussian normal distribution with zero mean and variance σ2. Let πj|i=P(Q=j|Mi1Mik) or P(ξij=1) denote the conditional probability of subject i carrying a certain QTL genotype j given its marker information. Then, the likelihood function based on the phenotype and multiple markers can be constructed as follows:

L(Θp,Θq;y,M1,...,Mk)=i=1n[j=02πj|ifj(yi;Θq)]=i=1n[j=02πj|i12πσ2exp((yiμj)22σ2)]

where Θp is a vector of the population genetic parameters that describe haplotype frequencies including the k known markers and one putative QTL, and Θq={μ0,μ1,μ2,σ2} is a vector of parameters for the phenotypic trait. The calculation πj|i  becomes more complicated as k, the number of known markers, increases, and this issue will be addressed in the section below.

Calculation of joint and conditional genotypic probabilities with k markers

For small k (e.g.  k=1 or 2),  the conditional probabilities of QTL given specific marker genotypes can be calculated from their corresponding haplotype frequencies [2]. However, as the numbers of haplotypes (2k)  and genotypes (3k) increase exponentially with the number of genetic loci (k), the calculation of genotypic probabilities becomes much more difficult for a large k. Here we describe a general algorithm for the calculation of genotypic probabilities for the k markers, which then can be used for constructing conditional probabilities of QTL given markers.

In general, genotypes can be categorized into two classes: homozygote or heterozygote. If a subject has homozygotes at all genetic loci, its joint genotypic probabilities can be simply expressed as the square of its underlying haplotype frequency. However, if at least one of the loci contains heterozygotes, the expression of its joint genotypic probabilities becomes complicated, as one genotype may be originated from the sum of two times the product of different haplotypes. For simplicity, the two alleles of a specific locus are denoted as 0 for the major allele and 1 for the minor allele, and the corresponding three genotypes are denoted as 0, 1 and 2, simply as the summation of the two alleles. We illustrate the algorithm using an example of five loci (k=5) in the following steps:

Step 1: Determine zygotic status (homozygote versus heterozygote) for genotypes at k loci.

Step 2: If genotypes of all k markers are homozygotes, i.e. genotypes of 0 or 2, their frequencies are then squares of corresponding haplotype frequencies. For example, for genotypes such as G00202 and G02002, their genotypic probabilities can be simply calculated as P(G00202) = p001012, and P(G02002) = p010012, where GM1Mk denotes the genotype of the k markers and p00101 and p01001 are the haplotype frequencies.

Step 3: If there is at least one heterozygote locus, then

Step 3.1: for homozygote loci (genotype of 0 or 2), their allelic representations in the two haplotypes are both 0s or both 1s;

Step 3.2: for heterozygote loci (genotype of 1), their allelic representations in two haplotypes are 0 and 1, which can be switched for different origin of parents;

Step 3.3: the overall genotypic probability is then the sum of two times the product of corresponding haplotypic frequencies.

For example, the genotype of AaBbccDDee(G11202) with five loci constists of two different haplotype combinations: (00101|11101) and (10101|01101), and the haplotyopes in each combination may have two different parental origins. So the probability of AaBbccDDee(G11202) can be expressed as: p(AaBbccDDee)= 2p00101p11101+ 2p10101p01101.

Parameter estimation

To obtain the maximum likelihood estimates (MLEs) of parameters in the likelihood function (3), we use the traditional Expectation-Maximization (EM) algorithm, which is efficient for mixture models. Particularly, in the setting of the mmLD, we propose a two-phase algorithm: the first phase is to estimate haplotype frequencies based on genotypes of the k known markers, and the second phase is to estimate the joint haplotype frequencies of QTL and the k known markers, and phenotypic parameters. As algorithms for estimating haplotype frequencies have been intensively studied, for our purpose, we simply adopted an available algorithm, called the ‘haplo-stat’. The ‘haplo-stat’ is also an EM-based method, which is proposed by Excoffier and Slatkin [16–18].

Overall, the detailed algorithmic workflow for the parameter estimation in mmLD is described as follows:

Step 1: Estimate the 2k haplotype frequencies for k known markers from the given genotype data. The R-package ‘haplo.stat’ is applied here [16].

Step 2: Set initial values Θ(0)=(Θp(0),Θq(0)): Θp(0) for the 2k+1 haplotypic frequencies of the k known markers and one QTL, and Θq(0) for phenotypic parameters. Θp(0) are initialized to be half of each haplotypic frequency of the k known markers from step 1.

Step 3: Derive the conditional joint genotypic probabilities from the haplotype frequencies based on the algorithm illustrated previously.

Step 4: E-step in the EM algorithm for solving the MLE: The data are first augmented with variables  Zij, indicator functions that are equal to 1 if a subject i carries QTL genotype j, and 0 otherwise. Then, for the rth  iteration, the complete log-likelihood can be expressed as:

logL==inj2[zijlog(πj|i(r))+zijlog(12π(σ(r)2))+zij((yiμj(r))22(σ(r)2)] (4)

The posterior probabilities for subject  i carrying the particular QTL genotypes j is:

E(zij)=Πj|i(r)=πj|i(r1)fj(yi|Θq(r1))j=02πj|i(r1)fj(yi|Θq(r1)) (5)

for i = 1, 2, …, n, and j = 0, 1 or 2.

Step 5: M-step in the EM algorithm for solving the MLE: In this step, we need to maximize the expectation of the complete log-likelihood:

E()=E(i=1nj=02[zijlog(πj|i)+zijlog(12πσ2)+zij((yiμj)22σ2)])=i=1nj=02Πj|i(r+1)log(πj|i)+i=1nj=02Πj|i(r+1)[log(12πσ2)(yiμj)22σ2]=Ezij((Θp);y,Θp(r),Θq(r))+Ezij((Θq);y,Θp(r),Θq(r)) (6)

As shown in (6), the function can be separately maximized regarding the phenotypic and haplotypic parameters.

M1. For the estimation of the phenotypic parameters (Θq), their expressions can be derived from the estimation equations and are in closed form as given below:

μ^j(r+1)=i=1nΠj|i(r+1)yii=1nΠj|i(r+1)   for j=0,1,or  2 (7)
σ^2(r+1)=i=1nj=02Πj|i(r+1)(yiμ^j(r+1))2i=1nj=02Πj|i(r+1) (8)

M2. For the estimation of the haplotypic parameters (Θp), another inner layer of EM algorithm can be used. In this step, we need to infer the haplotype frequencies of the  k known markers and the QTL from their expected genotypic probabilities. Again we adopted an available software, EH (Estimating Haplotype), here for our purpose [17].

Step 6: Iterate the steps 4–5 until the log-likelihood function converges.

Hypothesis testing

The overall goal here is to test whether any unknown QTL is in LD with a group of k known markers. Significant LDs infer that the k known markers and the unknown QTL are physically close, which could provide the guidance for subsequent biological validations. The hypothesis for the mmLD mapping can be formulated as follows:

H0: A QTL and known markers are independent, i.e. all DM1Mk,Q = 0

H1: At least one of equality of H0 is not true.

Under the null hypothesis (H0), the conditional genotypic probabilities of the QTL are constant throughout subjects regardless of the marker genotypes they carry. So the parameter set under H0 is ΘH0={p1,,pk, D12,,D1k,pQ,μ0,μ1,μ2,σ2}, where p1,,pk indicate the allelic frequencies of k known markers, D12,,D1k denote LDs among the k known markers, μ0,μ1,μ2,σ2 denote three means and the variance of phenotypic parameters and pQ represents the allelic frequency of the QTL. However, parameters under the alternative hypothesis (H1) include an additional set of parameter (D1,Q,,D12k,Q), which are LDs between the k known markers and the QTL. The EM algorithm to maximize the likelihood under H0 is similar to that under H1 in the previous section. Then, a likelihood ratio test (LRT) statistics can be constructed as follows:

LRT=2(H0H1) (9)

where H0 and H1 are the maximum likelihoods under H0 and H1, respectively. Under H0, the LRT asymptotically follows a χ2–distribution with a degree of freedom (d.f.) that is the difference between the numbers of parameters under H0 and H1.

Estimating the degrees of freedom

If all haplotypes of the k known markers exist and are estimable from the sampled data, then the d.f. of the LRT should be 2k1, i.e. the numbers of DM1Mk,Q. For example, when the number of known markers is 2, then the d.f. would be 221=3. However, in practice, we find that when k becomes large, the genotypic data are fragmented quickly and some haplotypes may have frequencies of zeros. Note that if  pM1Mk=0, then pM1Mk,Q=0, and according to equation (1), DM1Mk,Q=0 as well. Consequently, the LDs between zero-frequency haplotypes and QTL are determined and not unknown parameters any more. So, their numbers should not be counted toward the calculation of the d.f. Therefore, if some haplotypes have frequencies of zeros (say λ of them, λ>0), then the proper d.f. should be  2kλ1.

Another practical issue is for haplotypes with small quantities (e.g. 0.0001). Our simulation studies below showed that if they were always treated as zeros, which reduces the d.f., the overall type I error would be inflated; on the other hand, when they were always treated as non-zero quantities, the test would be too conservative. To solve this issue, we propose a novel sequential LRT procedure. That is, haplotypes with small frequencies will be tested sequentially from the smallest non-zero haplotype to examine whether they are significantly different from zero. The full model assumes that the small quantity is an effective non-zero haplotype, while the reduced model regards it as a zero quantity. The deviance of their likelihoods is tested by LRT with 1 d.f. The procedure will be repeated for one haplotype at a time from that with the smallest non-zero frequency, until the frequencies of all remaining haplotypes cannot be set to zero. Then the d.f. is the number of haplotypes with non-zero frequencies. Note that in this way, the d.f. for each sample is not pre-fixed, but data adaptive.

Results

Extensive Monte Carlo simulations have been performed to examine the statistical properties of the proposed mmLD mapping method. Let us consider a sample of n subjects randomly chosen from a human population under HWE. For each subject, suppose its phenotypic value (y) is controlled by an underlying QTL, which is in linkage with a group of k(2)  markers in a LD block. The marker and QTL genotypes were simulated based on pre-specified haplotype frequencies, and the phenotypic values were generated based on QTL genotypes according to equation (2). The variances in phenotypic values were determined by different heritability values (H2) [19], which quantifies the genetic contribution from the QTL to the overall variation in the trait. QTL information has been removed from mmLD mapping to mimic the real scenario that QTL may be ungenotyped. Each simulated setting was performed 1000 times for the evaluation of the type I error and power.

Evaluation of the type I error

We first consider simulations with three known markers and assume a sample size of 2000. The true haplotype frequencies of the three markers are given in Table 1, in which p000 is set to be zero, and the frequencies of another two haplotypes (p010 and p110) are set to be small values, 0.0005 and 0.0003, respectively. Based on these haplotype frequencies, marker genotypes can be generated for all subjects. As the type I error is evaluated under H0, i.e. QTL is independent of the markers, the QTL genotypes are then generated independent of the known markers with pQ=0.5, assuming HWE. The phenotypic means for the three QTL genotypes (0, 1 and 2) are assumed to be μ0=20, μ1=40 and μ2=60, respectively. Suppose the ith subject carries QTL genotype j, its phenotypic value is generated from a Gaussian distribution with the mean μj and a variance of σ2=5. The estimated type I error is calculated as the ratio of the number of simulations in which H0 is rejected over the total number of simulations.

Table 1.

. True parameters of the haplotype frequencies for three known markers and a representative example of the estimated haplotype frequencies from one simulation

Haplotype frequencies for known markers
p000 p001 p010 p011 p100 p101 p110 p111
Parameters 0 0.3192 0.0005 0.12 0.18 0.15 0.0003 0.23
Estimates 0 0.3242 0.0003 0.1075 0.1895 0.1496 1.3e-07 0.2289

In Table 1, estimates of p000, p010 and p110 in one simulation are provided as an example. It is clear that because the estimate of p^000=0, this haplotype does not contribute to the counting of the test d.f. However, how to deal with p^010=0.0003 and p^110=1.3×107 is questionable—if both are treated as zero frequencies, the test d.f. would be 4; if one is treated as zero and the other one as nonzero frequency, the d.f. would be 5; and if both are treated as non-zero frequencies, the d.f. would be 6. Note that if the correct test d.f. is used for calculating p-values, theoretically the p-values under H0 should follow a uniform distribution. Table 2 and Figure 1 showed the type I error evaluation and the goodness of fit of p-values to a uniform distribution (by Kolmogorov-Smirnov test statistics) with different d.f.s. When the d.f. was fixed at either 4, 5 or 6, not only are the type I errors incorrect, the distributions of the p-values were also highly skewed, indicating fixed d.f.s cannot correctly control the type I errors. In contrast, our sequential LRT procedure showed appropriate type I error and confirmed the uniform distribution of the p-values. The sequential LRT adaptively selected the d.f.s based on simulated data, which varied mostly between 4 and 5 in these 1000 simulations. This clearly demonstrates the advantage of our sequential LRT procedure.

Table 2.

. Comparison of type I error between tests with fixed d.f. and the test with d.f. selected from the sequential LRT procedure

Comparing method d.f. Type I error (α=0.05) Goodness of fit for uniform distribution
K-S statistics P value
LRT with fixed d.f. 4 0.063 0.1158 <.0001
5 0.035 0.0652 0.0004
6 0.016 0.1926 <.0001
Sequential LRT Data-adapted 0.051 0.0318 0.265

Figure 1.

Figure 1.

Empricial distributions of the p-values calculated based on tests with fixed d.f. and the tests with d.f. selected from the sequential LRT procedure, respectively.

Power comparison between smLD, SKAT_C and mmLD

Next, we evaluated the power of the mmLD mapping method, which is the probability of correctly detecting the existence of a QTL when a QTL is indeed in LD with tested markers. Two scenarios were considered: (1) a QTL is assumed to be located among a group of adjacent k markers and is not genotyped; or (2) the QTL is genotyped, as one of the known markers [2]. Powers were separately assessed for these two scenarios. Additionally, the mmLD has been compared with other methods that can handle multiple markers, such as the adjusted single marker LD test (smLD), and SKAT_C [10, 11, 20]. smLD refers that markers are tested individually and a Bonferroni adjustment is applied to check if any marker is significant.

Scenario 1—QTL is not genotyped: Two different settings with three or four known markers were considered here, and the marker and QTL genotypes were generated based on true values of parameters given in Tables 3 and 4. For each setting, the simulated data were generated under two cases: (1) true haplotype frequencies contain some zeros, and (2) true haplotype frequencies do not contain any zero. The phenotypic values were generated based on Gaussian distribution with different μ0=20, μ1=40 orμ2=60, as determined by QTL genotypes, and with variances that were determined by the heritability value (H2) [19]. The QTL information was discarded after simulation and therefore was not used in the analysis. Based on these parameters and designs, the power comparisons were conducted by different sample sizes (n = 100, 500, 1000 and 2000) and different heritability values (H2 = 0, 0.05, 0.1 and 0.2). Note that when H2 = 0, i.e. H0 is true, all methods can control type I reasonably well. Each setting was simulated 200 times for the power evaluation. As shown in Figures 2 and 3, in general, the detecting powers of all three methods increase as the sample size or the heritability becomes larger. Particularly, the mmLD method demonstrated much higher power than the other two methods: smLD and SKAT_C. Through these results, it is clear that the mmLD showed quite stable performance compared with other existing methods for the non-genotyped QTL.

Table 3.

. Simulation scenario 1 for three known markers and one QTL

Case 1 p000Q p000q p001Q p001q p010Q p010q p011Q p011q
0 0.16 0.13 0.04 0 0 0.1 0
p100Q p100q p101Q p101q p110Q p110q p111Q p111q
0.13 0.2 0 0 0.08 0.08 0 0.08
Case 2 p000Q p000q p001Q p001q p010Q p010q p011Q p011q
0.02 0.12 0.13 0.02 0.01 0.01 0.11 0.02
p100Q p100q p101Q p101q p110Q p110q p111Q p111q
0.01 0.23 0.1 0.1 0.05 0.01 0.02 0.04

Case 1: True haplotype frequencies contain some zeros. Case 2: All true haplotype frequencies are nonzeros.

Table 4.

Simulation scenario 1 for four known markers and one QTL

Case 1 p0000Q p0000q p0001Q p0001q p0010Q p0010q p0011Q p0011q
0.025 0.12 0.03 0.015 0 0 0 0
p0100Q p0100q p0101Q p0101q p0110Q p0110q p0111Q p0111q
0.05 0.1 0.09 0 0.02 0.04 0 0.01
p1000Q p1000q p1001Q p1001q p1010Q p1010q p1011Q p1011q
0.11 0 0 0.02 0 0.1 0.07 0
p1100Q p1100q p1101Q p1101q p1110Q p1110q p1111Q p1111q
0 0 0 0.02 0.07 0.04 0 0.07
Case 2 p0000Q p0000q p0001Q p0001q p0010Q p0010q p0011Q p0011q
0.01 0.055 0.06 0.015 0.03 0.02 0.01 0.05
p0100Q p0100q p0101Q p0101q p0110Q p0110q p0111Q p0111q
0.02 0.05 0.01 0.04 0.015 0.011 0.01 0.015
p1000Q p1000q p1001Q p1001q p1010Q p1010q p1011Q p1011q
0.014 0.005 0.09 0.02 0.04 0.1 0.07 0.05
p1100Q p1100q p1101Q p1101q p1110Q p1110q p1111Q p1111q
0.04 0.02 0.01 0.02 0.01 0.01 0.01 0.07

Case 1: True haplotype frequencies contain some zeros. Case 2: All true haplotype frequencies are nonzeros.

Figure 2.

Figure 2.

Power comparison of mmLD, smLD and SKAT_C for simulation scenario 1 with three known markers and one QTL. Case 1: True haplotype frequencies contain some zeros. Case 2: All true haplotype frequencies are non-zeros. The black solid line indicates the power of the mmLD, and the dashed lines indicate the powers of the smLD and SKAT_C.

Figure 3.

Figure 3.

Power comparison of mmLD, smLD and SKAT_C for simulation scenario 1 with four known markers and one QTL. Case 1: True haplotype frequencies contain some zeros. Case 2: All true haplotype frequencies are non-zeros. The black solid line indicates the power of the mmLD, and the dashed lines indicate the powers of the smLD and SKAT_C.

Scenario 2—QTL is genotyped as a marker: In this scenario, one marker was assumed to be a QTL. The simulated setting was given in Table 5, in which four markers were included and the fourth marker was set to be the QTL. It is expected that the single marker methods (smLD) would have better power than the multi-marker methods in this case because QTL is indeed genotyped. Figure 4 displays the power comparisons of the three methods. Importantly, while the power of the mmLD is slightly less than that of the smLD, it still showed comparable power to it. Particularly when the sample size is reasonably large, i.e. n>500, all methods have power of almost 1. This suggests when a QTL has indeed been genotyped, it is fairly easy to be detected by most methods. This simulation demonstrates the robustness of the mmLD.

Table 5.

Simulation scenario 2—true parameters of haplotype frequencies with four markers, one of which is set to be the QTL

p000Q p000q p001Q p001q p010Q p010q p011Q p011q
0.04 0 0 0.25 0 0 0 0.12
p100Q p100q p101Q p101q p110Q p110q p111Q p111q
0.23 0 0 0 0.06 0 0 0.3

Figure 4.

Figure 4.

Power comparison of mmLD, smLD and SKAT_C for simulation scenario 2 with three known markers. The black solid line indicates the power of the mmLD, and the dashed lines indicate the powers of the smLD and SKAT_C.

Powers with different number of markers

As it is expected that more markers could carry more information about their linked QTL, mmLD with more markers would seemingly increase the detecting power. However, it is also expected that the marginal gain of additional markers would decrease when markers are correlated. That is, most QTL effect that an additional marker could carry has already been reflected in the existing markers in the model. Additionally, mmLD with more markers usually requests more computation. So, a question of interest is how many markers should be included in mmLD to ensure sufficient power in QTL detection. To study this, we tracked the power change with different numbers of known markers.

In this simulation, we considered seven known markers and one QTL. The marker and QTL genotypes are generated according to the true haplotype frequencies given in Table 6. The phenotype is assumed to follow a Gaussian distribution with μ0 = 20, μ1 = 40 or μ2 = 60 and a variance corresponding to H2=0.05. The sample size is set to be 2000, and each simulation was conducted 500 times. Figure 5 showed the power change of the mmLD, SKAT_C and smLD, with different numbers of markers. With three known markers applied, the power of mmLD reached 1. Also, the powers of the other methods reached almost 1 when four or five markers were considered. However, the overall power of mmLD was consistently higher than those of SKAT_C or smLD. This suggests that in practice, we probably need to consider only four- or five-marker LD mapping.

Table 6.

True parameters of the haplotype frequencies for seven known markers and one QTL

SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 SNP7 QTL Simulated haplotype frequencies
0 0 0 0 0 0 0 1 0.009
0 0 0 0 1 1 1 1 0.03
0 0 0 1 1 0 1 1 0.01
0 0 0 1 1 1 1 1 0.05
0 0 1 0 1 0 0 0 0.02
0 1 0 1 1 0 1 0 0.1
0 1 0 1 1 1 1 0 0.004
0 1 0 0 1 1 1 1 0.01
0 1 0 1 1 1 1 1 0.08
0 1 1 0 0 1 1 1 0.04
0 1 1 0 1 0 0 1 0.02
1 0 0 0 0 1 0 0 0.1
1 0 0 1 0 1 0 1 0.04
1 0 1 0 0 0 0 1 0.2
1 0 1 0 1 1 0 1 0.08
1 0 1 1 0 1 1 1 0.05
1 1 0 0 0 1 1 0 0.007
1 1 1 1 0 0 0 0 0.05
1 1 1 1 1 1 1 0 0.07
1 1 1 0 1 1 0 1 0.03

Haplotypes that are not in the list have frequencies of zeros.

Figure 5.

Figure 5.

Change of power with different numbers of known markers. The black solid line indicates the power of the mmLD, and the dashed lines indicate the powers of the smLD and SKAT_C.

Real data application

To further evaluate the applicability of the mmLD mapping with real genomic structures, the new method was applied to a public data set, ‘GAW17’, which is kindly provided by Texas Biomedical Research Institute at Genetic Analysis Workshop [21]. The GAW17 was a ‘mini-exome’ scan, using real sequence data for several hundred genes from the 1000 Genomes Project [21, 22]. As the original phenotypic data provided by the GAW17 was simulated based on a few rare variants with minor allele frequencies (MAF < 5%), it is not readily applicable to our mmLD model. Thus, although the real genomic sequences were used here, we re-generated the phenotypic data with a specified QTL, and rare variants (MAF < 5%) were removed in this analysis as usually done in GWAS analyses. The mmLD mapping, smLD and SKAT_C were compared on SNP structures from three chromosomes (3rd, 5th and 7th). The QTLs were set to be the 8th, 28th and 75th SNP, located on the 3rd, 5th and 7th chromosomes, respectively. The QTLs are indicated by the vertical line in Figure 6. We assumed that the phenotype followed a Gaussian normal distribution with three means (μ0 = 20, μ1 = 40 and μ2 = 60), as determined by QTL genotypes, and a variance corresponding to H2=0.1. The mmLD was then applied to scan the whole chromosome using a moving window with five markers. In the case that several consecutive windows all show significance, the most significant one were reported.

Figure 6.

Figure 6.

The Manhattan plots for SNPs on chromosome 3, 5 and 7 in GAW17 data set. The x-axis indicates the SNP label, and the y-axis is the negative log10 of the p-values for corresponding SNPs. The horizontal dashed lines indicate the significance thresholds, and the vertical lines indicate the QTL locations.

Figure 6 shows the Manhattan plot of negative log10 p-�values of SNPs on the three scans. Based on the Bonferroni correction, the overall significant cutoffs were set to be 3.35, 3.39 and 3.41, respectively. It is clear that the p-values for several SNPs nearby true location of the QTL passed the significance level in three methods. The mmLD successfully identified the locations of the QTLs, which are the true locations of the QTLs. While smLD and SKAT_C could identify the preset QTL locations, they also showed significant signals at other locations. Although the mmLD showed some false signals on the 5th chromosome, it is much less than those of the smLD and SKAT_C. This suggests that mmLD can perform well with real genomic structures.

Discussion

The LD mapping method with two markers (tmLD) was an improvement from the single-marker-based LD mapping [2], and has been useful for QTL detection. However, it is expected that the power of tmLD can be further improved with more markers incorporated into the LD mapping method. In this study, we extended the tmLD method to a more general multi-marker framework. Particularly, we solved the d.f. issue regarding the LRT, by proposing a sequential procedure, a strategy analogous to the backward model selection. One important feature of the sequential procedure is that it yields the d.f. adaptively according to the sampled data. That is, haplotypes with small frequencies will be tested sequentially to determine whether it is significantly different from zero, and subsequently whether it should be counted to the d.f. The sequential procedure worked well in the simulated studies in terms of both type I error and power evaluations. The reason may lie in that if the frequency of one haplotype is small in the population, subjects carrying such haplotype might not be selected during the sampling, in which the effective haplotype frequency in a specific sample would be truly zero. Therefore, the degrees of freedom should vary from data to data.

In this sequential process, several tests have been conducted to search for the proper d.f. Therefore, multiple testing issues may occur owing to the iterative testing scheme. To examine how the multiple tests may affect the power of mmLD, we have used the Bonferroni method to control the family-wise type I error (data not shown). The issue is that the number of tests differs from data to data, so we cannot preset a number for all data. Rather we set the number of tests to be the number of haplotypes with small quantities, i.e. <12*n, where n is the sample size. Our simulation results showed that the adjustment for multiple testing did not seem to affect much on the power, compared with sequential LRT without adjustment. Thus, we expect that multiple-testing issue should not be a big concern in practice.

In the power evaluation, mmLD showed either higher than or almost equal power performance to the adjusted single-marker association test (smLD) and SKAT_C [10, 11], based on scenarios when a QTL is either genotyped or non-genotyped. A real data application has also been conducted to show the practical usage of mmLD. We expect that the mmLD can be a useful tool for future GWAS analyses, or for secondary analyses of large amount of existing GWASs.

Population structure may be of concern in a GWAS, as it can lead to spurious associations if sample stratification indeed exists in the data. In this case, some well-known methods developed to account for population structure can be incorporated into the mmLD mapping to address this issue [23]. For instance, we may first apply principal component analysis on the genotype data [24], and then choose the first few large principal components to be included in the Model (2) as additional covariates to account for the population structures. The computational framework for parameter estimation and hypothesis testing described here can then be applied with slight modifications.

It should be noted as mmLD capitalizes on assumption that the underlying QTL is in LD with a group of observed markers, implicitly suggesting that phenotypic traits to be considered in mmLD should be inheritable. For traits that are known to be heavily affected by somatic mutations, such as cancer, the mmLD would not be suitable. Ideally, the mmLD should be applied to GWAS with traits already proved to be heritable with other genetic evidence, such as those based on pedigrees or twin studies. Another note is that the implementation of mmLD depends on haplotype estimations from other R packages [16, 17], and so its execution is slower than both smLD and SKAT_C. We hope to improve computational efficiency of mmLD in the future by executing it with more efficient languages like C or C ++. For genome-wide scanning, one strategy to apply mmLD is that single-marker analysis can be applied first to filter out several candidate regions with loose threshold, and then mmLD can be applied to those candidate regions. This two-step process may greatly alleviate the computation burden of mmLD.

Key Points

  • We developed a general framework for linkage disequilibrium mapping with multiple markers.

  • The method explicitly incorporates the linkage disequilibrium of multiple SNPs into the mixture models.

  • A novel sequential likelihood ratio test procedure was proposed to iteratively remove haplotypes with zero frequencies, which subsequently determines the proper degree of freedom for the test.

  • The method has been applied to a real data set for its practical applicability.

Acknowledgements

This work was supported in part by the Stony Brook Targeted Research Opportunities (to SW).

Funding

The data set used in the real data application was obtained from preparation of the Genetic Analysis Workshop 17 Simulated Exome Data Set, supported in part by NIH R01 MH059490. Sequencing data was obtained from the 1000 Genomes Project (www.1000genomes.org) through GAW grant, R01 GM031575.

Soyoun Lee recently obtained her PhD from the Department of Applied Mathematics and Statistics, Stony Brook University.

Jie Yang is an Assistant Professor in the Department of Family, Population and Preventive Medicine, Stony Brook University.

Jiayu Huang is a PhD student in the Department of Applied Mathematics and Statistics, Stony Brook University.

Hao Chen is a PhD student in the Department of Applied Mathematics and Statistics, Stony Brook University.

Wei Hou is an Assistant Professor in the Department of Family, Population and Preventive Medicine, Stony Brook University.

Song Wu is an Assistant Professor in the Department of Applied Mathematics and Statistics Medicine, Stony Brook University. His research focuses on developing novel and powerful statistical methods to uncover the genetic basis of complex traits by integrating genetic/biological principles into statistical models.

References

  • 1. Wu R, Caswlla G,, Ma C-X.. Statistical Genetics of Quantitative Traits: Linkage, Maps, and QTL. New York, NY: Springer, 2007, xvi, 365 p. [Google Scholar]
  • 2. Yang J, Zhu W, Chen J, et al. Genome-wide two-marker linkage disequilibrium mapping of quantitative trait loci. BMC Genet 2014;15:20. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Lou XY, Casella G, Todhunter RJ, et al. A general statistical framework for unifying interval and linkage disequilibrium mapping: toward high-resolution mapping of quantitative traits. J Am Stat Assoc 2005;100:158–71. [Google Scholar]
  • 4. Weiss KM, Clark AG.. Linkage disequilibrium and the mapping of complex human traits. Trends Genet 2002;18:19–24. [DOI] [PubMed] [Google Scholar]
  • 5. Wu R, Zeng ZB.. Joint linkage and linkage disequilibrium mapping in natural populations. Genetics 2001;157:899–909. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Wang Z, Wu R.. A statistical model for high-resolution mapping of quantitative trait loci determining HIV dynamics. Stat Med 2004;23:3033–51. [DOI] [PubMed] [Google Scholar]
  • 7. Wu S, Yang J, Wu R.. Mapping quantitative trait loci in a non-equilibrium population. Stat Appl Genet Mol Biol 2010;�9:article32. [DOI] [PubMed] [Google Scholar]
  • 8. Wu R, Ma CX, Casella G.. Joint linkage and linkage disequilibrium mapping of quantitative trait loci in natural populations. Genetics 2002; 160: 779–92. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Wu S, Yang J, Wang CG, Wu RL.. A general quantitative genetic model for haplotyping a complex trait in humans. Curr Genomics 2007;8:343–50. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Wu MC, Lee S, Cai T, et al. Rare-variant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet 2011;89:82–93. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Lee S, Wu MC, Lin X.. Optimal tests for rare variant effects in sequencing association studies. Biostatistics 2012;13:�762–75. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Zhang CH. Nearly unbiased variable selection under minimax concave penalty. Ann Stat 2010;38:894–942. [Google Scholar]
  • 13. Lou XY, Casella G, Littell RC, et al. A haplotype-based algorithm for multilocus linkage disequilibrium mapping of quantitative trait loci with epistasis. Genetics 2003;163:�1533–48. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Wu S, Yang J, Wu RL.. Multilocus linkage disequilibrium mapping of epistatic quantitative trait loci that regulate HIV dynamics: a simulation approach. Stat Med 2006;25:3826–49. [DOI] [PubMed] [Google Scholar]
  • 15. Stern C. The Hardy-Weinberg Law. Science, 1943;97:137–8. [DOI] [PubMed] [Google Scholar]
  • 16. Sinnwell JP. S.D. haplo.stats: Statistical Analysis of Haplotypes with Traits and Covariates when Linkage Phase is Ambiguous. 2013. http://cran.r-project.org/web/packages/haplo.stats/index.html
  • 17. Ott J. User's Guide to the Estimating Haplotype Program. 2013. http://www.jurgott.org/linkage/eh.htm
  • 18. Clayton D. SNPHAP. 2011. https://www-gene.cimr.cam.ac.uk/staff/clayton/software/
  • 19. Kempthorne O. An Introduction to Genetic Statistics. Wiley Publications in Statistics. New York, NY: Wiley, 1957, 545 p. [Google Scholar]
  • 20. Ionita-Laza I, Lee S, Makarov V, et al. Sequence kernel association tests for the combined effect of rare and common variants. Am J Hum Genet 2013;92:841–53. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Institute, T.B.R. GAW17 Data. 2010; GAW grant, R01 GM031575. http://www.gaworkshop.org/gaw17/data.html
  • 22. Genomes. A Deep Catalog of Human Genetic Variation. NIH R01 MH059490. http://www.1000genomes.org/
  • 23. Wu CQ, DeWan A, Hoh J, Wang Z.. A comparison of association methods correcting for population stratification in case-control studies. Ann Hum Genet 2011;75:418–27. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Price AL, Patterson NJ, Plenge RM, et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat Genetics 2006;38:904–9. [DOI] [PubMed] [Google Scholar]

Articles from Briefings in Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES