Skip to main content
Genetics logoLink to Genetics
. 2006 Nov;174(3):1597–1611. doi: 10.1534/genetics.106.061275

Association Mapping of Complex Trait Loci With Context-Dependent Effects and Unknown Context Variable

Mikko J Sillanpää *,1, Madhuchhanda Bhattacharjee
PMCID: PMC1667093  PMID: 17028339

Abstract

A novel method for Bayesian analysis of genetic heterogeneity and multilocus association in random population samples is presented. The method is valid for quantitative and binary traits as well as for multiallelic markers. In the method, individuals are stochastically assigned into two etiological groups that can have both their own, and possibly different, subsets of trait-associated (disease-predisposing) loci or alleles. The method is favorable especially in situations when etiological models are stratified by the factors that are unknown or went unmeasured, that is, if genetic heterogeneity is due to, for example, unknown genes × environment or genes × gene interactions. Additionally, a heterogeneity structure for the phenotype does not need to follow the structure of the general population; it can have a distinct selection history. The performance of the method is illustrated with simulated example of genes × environment interaction (quantitative trait with loosely linked markers) and compared to the results of single-group analysis in the presence of missing data. Additionally, example analyses with previously analyzed cystic fibrosis and type 2 diabetes data sets (binary traits with closely linked markers) are presented. The implementation (written in WinBUGS) is freely available for research purposes from http://www.rni.helsinki.fi/∼mjs/.


WITH the wide availability of markers, association mapping has been increasingly recognized as a primary tool to identify parts of chromosomes that may show a functional relationship to the phenotype (Risch and Merikangas 1996; Flint and Mott 2001; Lohmueller et al. 2003). Population-based association studies suffer from confounding due to population stratification (inability to divide variance into within- and among-population components) and genetic heterogeneity (trait loci or their alleles are not unique for the trait). If not accounted for properly, hidden population structure (stratification) may give rise to false positives (Lander and Schork 1994; Cardon and Palmer 2003) and genetic heterogeneity can dramatically disturb or mask the mapping signals (Terwilliger and Weiss 1998; Thorton-Wells et al. 2004). This is why both confounding and heterogeneity are probable contributors to the problem of nonreplication in genetic studies of complex traits (Sillanpää and Auranen 2004).

Techniques such as stratified analysis (Clayton 2001), matching (Hinds et al. 2004), genomic controls (Devlin and Roeder 1999; Marchini et al. 2004), structured association (Pritchard et al. 2000a; Sillanpää et al. 2001; Hoggart et al. 2003), smoothing (Conti and Witte 2003; Sillanpää and Bhattacharjee 2005), use of secondary samples (Epstein et al. 2005; Kazeem and Farrell 2005), or approaches based on knowledge of relatives (Ewens and Spielman 1995; Thomson 1995; Knapp and Becker 2003) have been used to overcome the problem of population stratification. (For extensive comparison, see Setakis et al. 2006.) Similarly, there are approaches based on relationship information (linkage analysis and identity-by-descent methods), haplotype frequency profiles (Longmate 2001), or smoothing/partition/clustering of haplotypes or alleles (Thomas et al. 2001; Morris et al. 2002, 2003; Seaman et al. 2002; Molitor et al. 2003a,b; Durrant et al. 2004; Yu et al. 2004a,b) that are robust to allelic heterogeneity.

Several model-based and model-free methods consider locus heterogeneity in the context of linkage analysis or family data (Smith 1963; Leal 1997; Grigull et al. 2001; Province et al. 2001; Schaid et al. 2001; Shannon et al. 2001; Whittemore and Halpern 2001; Hodge et al. 2002; Bull et al. 2003; Ekstrom and Dalgaard 2003; Hauser et al. 2004; Hoti et al. 2004); however, few consider locus heterogeneity in association analysis or case–control data. To prevent confounding due to locus heterogeneity in association analysis, one may apply a subset analysis (Leal and Ott 2000; Rebbeck et al. 2004), which, however, requires known subsets or subsets stratified on the basis of external covariates (e.g., expression arrays or proteomics) and lacks some power. Another approach proposed by Schork et al. (2001) for the case of unknown subsets is a clustering of individuals using a set of neutral markers (or additional covariates) and incorporating this information into the subsequent association study. Province et al. (2001) have suggested use of additional covariates/markers and a recursive partitioning approach for a similar purpose. See Thorton-Wells et al. (2004) for discussion of clustering analysis, latent class analysis, and factor analysis in the context of producing homogeneous subsets of the data. To address locus/allelic heterogeneity, Sillanpää et al. (2001) suggested a joint analysis, where estimation of hidden population structure from neutral markers and detection of genotype–phenotype associations were performed simultaneously in a single modeling framework. The general problem in using population structure estimation inferred from neutral markers, known covariates (age at onset), or ethnic background to address genetic heterogeneity is that one needs to assume that the structure of genetic heterogeneity for the particular trait follows that of the general population or that given by covariates or ethnic background (Cooper et al. 2003; Foster and Sharp 2004). (For the opposite view, see Burchard et al. 2003.) This is especially problematic in the presence of missing data (Thorton-Wells et al. 2004). Several other assumptions are also required such as that the neutral markers have to show allele frequency difference between and Hardy–Weinberg and linkage equilibrium within the original populations (Pritchard et al. 2000b; Sillanpää et al. 2001).

One can improve the chances of finding trait loci in association analysis by using multiple gene models (Devlin et al. 2003; Kilpikari and Sillanpää 2003) and model selection (Balding et al. 2002; Broman and Speed 2002; Sillanpää and Corander 2002). To address genetic heterogeneity, we consider a multilocus association model with the joint estimation of population assignment, but unlike Sillanpää et al. (2001) we do not use any additional set of neutral markers or covariates, only the phenotypic model. Such treatment is motivated in situations when a structure of genetic heterogeneity for the particular trait does not follow that of the general population or that given by covariates or ethnic background (Cooper et al. 2003; Foster and Sharp 2004; Thorton-Wells et al. 2004). That is, the factor that would stratify the subsets is unknown or went unmeasured and it cannot be estimated on the basis of external information—only phenotype carries some information. In unclear cases, one can compare the results in two situations: (1) by estimating the unknown stratifying factor simultaneously in the analysis and (2) by treating the estimated or self-reported ethnic background as a known stratifying factor in the analysis. Our Bayesian approach is based on locus-specific indicator variables (Uimari and Hoeschele 1997; Conti et al. 2003; Yi et al. 2003a; Yi 2004; Sillanpää and Bhattacharjee 2005), which are used to control inclusion or exclusion of contribution from a particular locus in the multiple-regression model so that exclusion has much higher a priori probability. The method could be seen as an extension of the earlier work (Sillanpää et al. 2001; Kilpikari and Sillanpää 2003; Sillanpää and Bhattacharjee 2005); see the discussion for differences. The proposed model is implemented using the WinBUGS software (Gilks et al. 1994; Spiegelhalter et al. 1999) and the performance of the method is illustrated in several settings for a quantitative trait with a sparse set of markers and compared with single-group analysis by using simulated data. To illustrate performance for a binary trait and closely linked markers, we analyzed real data sets of cystic fibrosis (Kerem et al. 1989) and of type 2 diabetes (Horikawa et al. 2000; Zöllner and Pritchard 2005).

MODEL

Motivations:

The model presented here is designed to be robust against genetic heterogeneity due to multiple causes: (1) genes × environment interaction, when the environmental exposure is unknown or went unmeasured; (2) genes × gene interaction, when there is no measurement from the stratifying gene; (3) rapid population expansion from a small founding population (say 50 individuals), where two founder individuals (with their own etiologies) are carrying the interesting form of the trait; (4) admixture of two populations (with their own etiologies) in the remote past so that linkage disequilibrium due to an admixture event has already vanished; and (5) structure of the population and the etiological structure of the trait have evolved separately—they have distinct selection histories. In each of these situations, one cannot utilize neutral markers or additional covariates to estimate etiological structure. Nor can one ensure that any of the above conditions are met in practice. However, in the case of no heterogeneity, this model maintains the power comparable to that of the single-group analysis (model of Sillanpää and Bhattacharjee 2005). For practical perspective on prior evidence of genetic heterogeneity, see the discussion.

Notation:

Let us consider a set of M candidate markers and a vector (N1, … , NM), where Nl (≥2) is the number of alleles at locus l. These candidates may represent a preselected set of haplotype-tagging SNP markers (Meng et al. 2003; Lin and Altman 2004) or a set of chromosomal regions or haplotype blocks (International HapMap Consortium 2003, 2005), where different haplotypes (within each region/block) are treated as different alleles. The aim is to find a trait-associated subset of loci among the M candidates for some particular trait that is either quantitative (continuous) or qualitative (binary) type. Because only a discrete set of candidate loci is considered, the associated locus is likely to be just a close candidate that is in linkage disequilibrium with the true trait locus. We use a term QTL generally as a synonym for such a candidate locus linked to the quantitative or qualitative trait. We assume that phenotypes Inline graphic and marker observations Inline graphic have been collected in a set of M marker loci from Nind unrelated individuals. This sample may consist of individuals from the general population or of cases and controls. In the absence of missing data, marker observations mobs give complete genotype information m = (mi). Here, i refers to individual and mi = (mi1, mi2, … , miM), where Inline graphic) consists of the two alleles (assumed to be known without error) at each marker locus l. Note that the alleles Inline graphic and Inline graphic are both in the range [1, Nl].

Missing-data model:

We assume that there might be some missing observations among the marker genotypes and that missing marker genotypes occur at random and independently within and across markers (in the sense that the probability that the genotype is missing is not dependent on the true genotype pattern at the locus or at any of its neighboring markers). By factorizing the joint distribution of complete and observed marker data p(mobs, m) = p(mobs | m)p(m), we obtain an indicator function p(mobs | m) and the prior for complete observations p(m). Following the usual Bayesian missing-data model (Sillanpää and Bhattacharjee 2005; missing-data model 2), the prior distribution for complete genotype data p(m) under Hardy–Weinberg equilibrium is a multinomial distribution, where the occurrence probability of each allele (allele frequency) within the locus is assumed to be equal. (Note that data augmentation under this model is based on the likelihood of the data.) Given this prior for complete genotypes m, we consider only a subset of m in which p(mobs | m) = 1. This is equal to assuming that missing value imputations are made conditionally on the observations.

Genetic model:

Let us assume two etiology groups (with possibly their own trait loci and/or associated alleles) and that each individual i has its own assignment variable with value 1 (Ei = 1) or 2 (Ei = 2) indicating assignment to one of the groups. Define an indicator variable Ilj for group j (at marker l), where the value 1 (Ilj = 1) corresponds to the case where the marker l at group j is included in the model and value 0 (Ilj = 0) implies exclusion. To model genetic effects, we assume that alleles act additively both within locus (no dominance) and between loci (no epistasis). Each group j at each marker position l has its own vector of genetic effect coefficients Inline graphic, where Inline graphic is the coefficient for allele a at marker l at group j, where a = 1, … , Nl and l = 1, … , M and j = 1, 2. Given the assignment of individuals E = (Ei) and the group-specific quantities—vector of indicators I = (Il1, Il2), overall means α = (α1, α2), and effects β = (Inline graphic)—our genetic model with additive allelic effects for observation yi (individual i) can be written as

graphic file with name M9.gif (1)

where the variable Inline graphic is 1 is the case that an assignment variable Ei (of individual i) equals group j and is zero otherwise; the residuals ei, regardless of the individual's etiology group, are assumed to be normally distributed N(0, 1/τ), with common precision parameter Inline graphic (i.e., inverse of residual variance). Binary phenotypes can also be considered by using a logit link function and omitting the residuals of the model (1) (see Uimari and Sillanpää 2001 and the Discussion in Sillanpää and Bhattacharjee 2005). We allow the first coefficient (Inline graphic) at each locus l and each group j to be unconstrained in the model. For discussion of alternative formulations of genetic (genotype and haplotype) effects, see Sillanpää and Bhattacharjee (2005).

Hierarchical model:

Let us have a vector of locus-specific genetic variance components σ2 = (Inline graphic) over M loci and assume a random variance model for the genetic effects β | σ2 at each locus (for motivation, see the discussion). Let us also prespecify the prior expectation of the proportion of trait-associated markers among all candidates, denoted as s. To hierarchically model assignment variables E and adopting ignorance in specifying a uniform prior distribution (with probabilities Inline graphic and Inline graphic) for the proportions of individuals in each of two groups, we assume an underlying hyperparameter κ2 describing a proportion (relative frequency) of individuals that are members in group 2. The simple uniform distribution can be adopted if there is some prior information that sizes of the two groups are equal.

The posterior distribution p(I, α, β, E, κ2, τ, σ2, m | y, mobs, s) is proportional to a joint distribution of parameters {I, α, β, E, κ2, τ, σ2, m} and the observed data {y, mobs}. This relation is known as Bayes' rule and is here conditional on fixed quantity s; see below. We make the following conditional independence assumptions: (i) given σ2 and s, the locus indicators I and genetic effects β are independent; (ii) given κ2, the complete marker genotypes m, the assignment variables E, and the regression parameters {α, τ} are mutually independent; and (iii) given σ2, s, and κ2, the locus indicators, the genetic effects, the regression parameters, the assignment variables, and the complete marker genotypes are all mutually independent. Under these conditional independence assumptions that are made a priori, one can introduce a hierarchical model as a factorization of joint distribution of parameters and data (Figure 1). More explicitly,

graphic file with name M16.gif

Here the likelihood p(y | m, I, α, β, E, τ) in the case of a quantitative trait is a normal density function and in the case of binary trait is an inverse logistic function (see Uimari and Sillanpää 2001; Sillanpää and Bhattacharjee 2005). The actual likelihood value is obtained by substituting residuals ei (= observed − estimated trait value) of genetic model (1) into the likelihood (in the case of a binary trait, the estimated trait values are substituted instead of residuals). The following priors are specified for the parameters. The marker-independence prior for indicator variables I is

graphic file with name M17.gif

Here p(Ilj | s) is a Bernoulli distribution with parameter s representing a small prior probability (expectation) for a locus to be associated into the trait. We give Inline graphic, which corresponds to a prior belief of one QTL among all candidates. For closely linked markers, like haplotype-tagging SNPs, one could use a marker-dependence prior as presented in Sillanpää and Bhattacharjee (2005), where the value of an indicator is dependent on the other indicators in the region. Prior distribution Inline graphic for genetic coefficients Inline graphic (allele a) were assumed to be normal N(0, Inline graphic) with locus-specific variance component Inline graphic, which is common for both groups (j = 1, 2). This leads to the joint prior Inline graphic. The prior for genetic variance Inline graphic at locus l was given an inverse Gamma (1, 1), and consequently Inline graphic. We assume that a prior for the assignment variables, p(Ei | κ2), is a Bernoulli distribution with parameter κ2 = p(Ei = 2) representing a probability of an individual to be assigned in group 2; a prior p2) was assumed to be Beta (1, 1). (Alternatively, one could have prespecified the fixed value Inline graphic for κ2, representing prior belief of both values of individual assignments Ei being equally probable, implying p2) = 1.) This leads to the joint prior Inline graphic. The prior for precision parameter p(τ) was Gamma (1, 1) and the prior for both overall mean parameters p1) = p2) is N(0, 10), and p(α) = p1)p2). Note that choice of the priors for the genetic variances, the precision parameter, and the overall means reflect the measurement scale of the trait.

Figure 1.—

Figure 1.—

Directed acyclic graph (DAG) graphically summarizing hierarchical structure of the model. Note that the conditional independence assumptions are visible in this graph. Boxes refer to prespecified values or observed data and ellipses to random variables, whose unknown values are estimated in the analysis. Directions of hierarchical (solid arrows) and deterministic (dashed arrows) relations are shown. In the case of binary phenotype, the precision parameter (τ) is omitted from the model.

SIMULATED DATA ANALYSIS

In data generation, we wanted to mimic a situation, where there are two etiological groups that arise because of presence or absence of exposure to some factor (modifier) that is completely unknown or went unmeasured. In such a situation, two groups may have identical genotype distributions (and homogeneous ethnic background) and one cannot utilize neutral markers or self-reported ethnic background to estimate heterogeneity underlying the phenotype. These kinds of data may arise when there is gene–environment interaction. We first describe how the homogeneous population of 1000 sampled individuals in the current generation was generated. Then we explain how we created (sampled) two subgroups from this homogeneous population so that there are between-group differences in trait etiology but no differences in genotype frequencies.

Generation of homogeneous population:

We first adopted a simulated marker data set used in Kilpikari and Sillanpää (2003). The data set consisted of 1000 individuals with a complete set of genotypes at 36 multiallelic markers, with recombination fractions in between 0.01 and 0.5. There were varying numbers (two to seven) of segregating alleles at each locus with average heterozygosity of 0.65. These 1000 individuals, which were generated using a backward population simulator (Gasbarra et al. 2005), had a common founding population (466 founders) 10 generations ago. The following assumptions were used: Hardy–Weinberg and linkage equilibrium for founders and nonrandom mating and slowly increasing population size in each (discrete) generation. For more details of the original data set, see Kilpikari and Sillanpää (2003) and for the simulator see Gasbarra et al. (2005).

Simulating etiological subgroups:

We created two etiology groups so that each individual in the homogeneous data set (1000 individuals in total) had ∼25% chance to be randomly assigned into each one of two groups. This sampling process resulted in 244 individuals in group 1 and 220 individuals in group 2. Such a “drop” in the number of study individuals (from 1000 to 464) was partly motivated by the reduced computation time. The quantitative phenotypes in both groups were generated analogously using an additive generating model: phenotype = overall mean + a sum of additive genetic values of the trait loci + residual sampled from a standard normal distribution, N(0, 1). The two groups were simulated to have their own values for overall mean, trait loci (exactly at markers), and their genetic effects (see Table 1). Note that only one of the three simulated trait loci was active in both groups and even there the active alleles were different (allele heterogeneity). The recombination fractions surrounding the simulated QTL were the following: 0.5 and 0.5 for QTL at marker 4; 0.02 and 0.2 for QTL at marker 16; and 0.1 and 0.1 for QTL at marker 30. Individual observations of two groups were then combined (464 individuals altogether) and a group status of each individual was “forgotten” in the data analysis stage. In this combined data set, the heritability attributable to each QTL at markers 4, 16, and 30 was 0.37, 0.12, and 0.05, respectively. This corresponds to the overall heritability of 0.55. Even if heritability in this data set may appear to be unrealistically high, the analysis presented here arguably corresponds to the analysis of a larger sample and smaller heritability. The simulated phenotypic distributions of the two groups are shown in Figure 2. Note that these two distributions overlap reasonably well and that the same range of values is covered in both distributions. We refer to this combined data set later in the article as a complete data set.

TABLE 1.

The simulated values of overall mean, trait-associated markers, and additive values of their influential alleles in the two etiology groups

Group Overall mean Marker Allele Additive value
1 2.3 4 4 −3.2
16 1 2.0
2 −2.3 16 3 1.2
30 5 1.2

Figure 2.—

Figure 2.—

The phenotypic distributions of the two groups drawn as frequency polygons (solid line, group 1; dotted line, group 2). The phenotypic “classes” (the central points indicated with the circles) are shown on the x-axis and corresponding frequencies on the y-axis.

Analyses:

We sequentially increased the amount of random missingness in the genotype data. We considered the complete data set and four other data sets, where 5, 10, 15, or 20% of the marker genotypes of the original complete set were coded as missing. The missingness was introduced in a nested manner so that all genotypes that were missing in the 5% data set were also missing in the 10, 15, and 20% data sets and so on. All five marker data sets (with identical phenotypes) were analyzed with the proposed method. For comparison, a single group analysis was also performed (which is equivalent to constraining all individuals to belong to one group only) for the data set with 5% missing values. Note that the single-group analysis closely corresponds to the methods of Kilpikari and Sillanpää (2003) and Sillanpää and Bhattacharjee (2005) (with marker-independence prior), whose performances are roughly comparable to the frequentist multiple regression approaches (see the above articles for details).

The estimation of the model parameters was performed in WinBUGS 1.3 (Gilks et al. 1994; Spiegelhalter et al. 1999) using a Pentium 4, 2.8 GHz. We used random initial values in the analyses. In all analyses, we ran two Markov chain Monte Carlo (MCMC) chains of length 25,000. The first 5000 “burn-in” iterations were discarded from each chain, which resulted in 40,000 pooled MCMC samples in total. (In analysis of the data set with 10% missing values we utilized only samples from a single chain due to reasons explained in results.) We stored all the MCMC samples (“no thinning”), because of a sufficient storage capacity and a low autocorrelation in the samples. Two MCMC chains were run in parallel, which took ∼40 hr for the complete data and up to 60 hr for the 20% missing data. However, the same number of iterations for a single-group analysis took only ∼8 hr (with 5% missing data). The convergence assessment was performed by visually monitoring chains for several different parameters.

SIMULATION RESULTS

In Table 2, one can see the estimated posterior probabilities for different markers to be associated on the phenotype (i.e., QTL probabilities) at a group level and the posterior expectation (over markers) for the number of QTL in given groups. Table 2 illustrates the effect of cumulative missingness on these probabilities. In the complete data and the data set with 5% missing values, only the correct QTL positions for the two groups (markers 4, 16, and 30) are supported with QTL probability 1 and there is very little support for other positions. For comparison, the single-group analysis in the data set with 5% missing values (not shown in Table 2) resulted in nonzero QTL probabilities for markers 4, 16, 20, and 30 with values 1.0, 0.4, 0.1, and 0.5, respectively. With increasing missingness in Table 2, also spurious QTL positions (markers 13 and 29) gathered more support (with QTL probability ≤0.6). The analysis of the data set with 10% missing values showed us that we cannot clearly identify the two different genetic mechanisms in this case (if we utilize samples from both chains).

TABLE 2.

Group-specific QTL probabilities

Group 1
Group 2
Marker 0% 5% 10% 15% 20% 0% 5% 10% 15% 20%
4 1.0 1.0 0.5 1.0 1.0 0.0 0.0 0.5 0.0 0.0
13 0.0 0.1 0.0 0.5 0.5 0.0 0.0 0.1 0.0 0.0
16 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
29 0.0 0.0 0.0 0.0 0.6 0.0 0.0 0.0 0.0 0.0
30 0.0 0.0 0.5 0.0 0.1 1.0 1.0 0.5 1.0 1.0
E(NQTL | data) 2.07 2.17 2.36 2.56 3.16 2.08 2.06 2.03 2.05 2.03

Different percentages (0, 5, 10, 15, and 20%) indicate the different data sets with corresponding percentage of missing values. For all data sets, the estimates are presented on the basis of the group labels giving the best fit. Only the markers having at least one QTL probability ≥0.1 are shown. At the bottom, the posterior expected number of QTL for a given group, denoted by E(NQTL | data), is shown.

Labeling problem:

A well-known problem in this kind of population assignment method is the weak identifiability of the labels of the two groups (see Pritchard et al. 2000b; Stephens 2000). This means that sometimes the group configurations obtained in one analysis end up as their mirror image in another analysis just because of the symmetry in the likelihood of the parameters (Figure 3). Following the standard practice of sophisticated population assignment methods in genetics, like those of Pritchard et al. (2000b), there is nothing in the model, except the initial values of the MCMC sampler, to attach individuals of group one to the group with index one. The suggested solution for this problem is to impose some order constraints on one of the parameters such as constraining the groups to have increasing means or increasing proportions of individuals in the groups (Richardson and Green 1997; Stephens 2000). However, R. M. Neal's comment in Kass et al. (1998) was strongly against using such constraints, because they can do serious harm to the convergence. Here such constraints presumably do not provide good identifiability for the groups and complicate interpretation, because the information with respect to the groups at some loci seems to be too weak on the basis of our experiences with the secondary analysis (see conditional analyses below). In unconstrained models such as the one here, it can actually be preferable that the MCMC sampler is mixing poorly with respect to symmetry of the two groups (i.e., label switching does not occur), because it simplifies interpretation of the results (see Pritchard et al. 2000b). Note that Celeux et al. (2000) have suggested a tempering scheme for assignment models to improve the mixing properties of the sampler. As is typical in population assignment methods (see Pritchard et al. 2000b), the label switching did not occur within any single MCMC chain in our analyses. Such a behavior was found here only between two separate MCMC chains with the data set with 10% missing values.

Figure 3.—

Figure 3.—

Illustration of how markers selected to be active (thick ticks on the x-axis) in the two etiology groups (1 and 2) can lead to different posteriors in three hypothetical MCMC analyses (A–C). In the first analysis (A), markers 2, 9, and 21 are active in group 1 and markers 4 and 16 are active in group 2. In the second analysis (B), the opposite group configuration is supported. In the third analysis (C), the MCMC sampler has ended up switching between configurations of A and B.

Conditional analyses:

With real data, one may not be able to decide if the estimated genetic architecture is unclear (not unique) because of the label switching or because of the complexity of the underlying genetic architecture. Even though it was known in our simulation study that the label switching was responsible, we still performed the additional analyses with the data set having 10% missing values to illustrate one possible approach. Three additional analyses were performed so that a locus-indicator pair (corresponding to two groups) at a single locus of the three markers (4, 16, and 30) was kept fixed throughout the MCMC estimation process. These three conditional analyses are “short cuts” for exhaustive enumeration of analyses where all possible combinations of the gene actions over these three loci are fixed one at a time. However, we can still gain some further information about the genetic architecture this way. Two parallel chains were run with 10,000 iterations in each (the first 2500 iterations from both were discarded, as burn in). This resulted in 15,000 effective MCMC samples in total.

In the first conditional analysis, where two group-specific locus indicators of marker 4 were simultaneously fixed to values of I4,1 = 1 and I4,2 = 0, the analysis was able to reconstruct the true underlying genetic architecture extremely well, having well-structured QTL probabilities of the two groups at markers 16 (1.0, 1.0) and 30 (0.1, 1.0) and a negligible support for marker 13 (0.1, 0.0), respectively. In the second analysis, where two indicators of marker 16 were fixed to values of I16,1 = 1 and I16,2 = 0, several nonzero QTL probabilities appeared at new (incorrect) marker positions in particular for group 2, indicating that the locus indicator I16,2 should not be zero. This was further supported also by the QTL probabilities in group 1 for markers 4 and 30, which were 0.5 and 0.5, respectively. In the third analysis, where the two indicators of marker 30 were fixed to values of I30,1 = 1 and I30,2 = 0, the problem of label switching occurred again even after the constraint so that two MCMC samples were mirror images of each other (excluding the fixed locus 30). This behavior might be an indication of the weak genetic effect of marker 30 (cf. the simulated effect sizes of Table 1). However, one of the two MCMC chains converged to the same structure supported by the first conditional analysis (of marker 4), which also happened to be the true underlying structure.

Summarizing QTL effects:

Because a random variance model was used for QTL effects, it is natural to look first at the posterior estimates of genetic variances and then at the sizes of the individual effects. As expected, the estimated genetic variances (their posterior means) at positions with low QTL probabilities are very close to their prior mean 1 (the results not shown). This is because in our model the data do not influence the genetic variance estimate in the MCMC rounds where the corresponding locus indicator value is zero. Due to (i) assuming a priori independence of the indicators and genetic effects (cf. Kuo and Mallick 1998), (ii) assuming only a single genetic variance parameter for each marker, and (iii) assuming a nonzero prior mean for the genetic variance, we summarize our results in two groups in the form of weighted genetic variances (Inline graphic) in Figure 4. The weighted genetic variance is actually the model-averaged estimate of the genetic variance (averaged over all models with the effect set to zero in models where the marker for a given group was not selected) (Sillanpää and Bhattacharjee 2005). See the discussion for the motivation for points i and ii above. (Because of point iii above, we expected to obtain better mixing properties for the sampler and avoid confounding between locus indicators and genetic effects, leading to better estimates of QTL probabilities.) The estimates of Figure 4 are based on the additional analysis with two parallel MCMC chains (both of length 1500 and no burn in), where the last states of parameter values (of earlier 25,000 MCMC rounds) were used as starting values. (In practice this can be seen also as discarding 25,000 MCMC samples as burn in from each chain.) For a comparison, the weighted genetic variances from the single-group analysis (in the data set with 5% missing values) are also shown (Figure 4A).

Figure 4.—

Figure 4.—

The weighted genetic variances. Locus-specific point estimates (posterior mean) of weighted genetic variances in two groups for complete data (B), a data set with 5% missing values (C), a data set with 10% missing values (D), a data set with 15% missing values (E), a data set with 20% missing values (F), and a single-group analysis (A) for a comparison are shown. These quantities are calculated on the basis of pooling samples from two separate MCMC chains with 1500 samples (after 25,000 burn in) for each. Marker numbers are shown on the x-axis and weighted genetic variances are on the y-axis. Data sets are labeled using groups that give the best fit. Note that only a single MCMC chain was utilized in D. ▪, group 1; □, group 2.

Dissecting allelic heterogeneity:

The posterior estimates of locus-specific allelic effects at two groups for complete data are shown in Figure 5. In the top, one can see that only three simulated QTL seem to have nonnegligible estimated allelic effects, which are shown in more detail below (Figure 5, bottom). Since we did not impose any constraints for the coefficients, one should interpret the graphs with respect to that. (One can impose constraints afterward by setting the first coefficient to zero and looking at differences (contrasts) of estimated coefficients at each locus.) If the constraints are imposed afterward, then only four true alleles seem to have nonnegligible effects (allele 4 at marker 4, alleles 1 and 3 at marker 16, and allele 5 at marker 30), which closely correspond to the simulated values of Table 1. In Figure 5, bottom, we observe that marker 16 is the only one showing evidence of allelic heterogeneity and the evidence is with respect to the correct alleles (1 and 3).

Figure 5.—

Figure 5.—

QTL allelic effects. Locus-specific point estimates (posterior mean) of allelic effects in two groups for complete data. The effects are shown for all the markers (top) and for the markers having nonnegligible effects on the phenotype (bottom). The marker and allele numbers are shown on the x-axis and allelic effects on the y-axis. In the top, the allelic effects are presented as a curve (frequency polygon) in two groups (solid line, group 1; shaded line, group 2). Vertical lines in the bottom indicate change of marker. ▪, group 1; □, group 2.

Concordance in assignments:

To illustrate how well the individual assignments can be identified from five different data sets, Table 3 presents the numbers of correctly and incorrectly classified individuals in each group. The estimated group for each individual is based on the posterior mean estimate of the group assignment. The posterior mean estimated proportion of individuals belonging to the smaller of the two groups is also shown (true value is 0.47). Note that the proportion of incorrectly classified individuals is slowly increasing with the amount of missing data, but the “prop of inds”, which equals p2 | y, mobs, s), does not seem to do so. We would have expected to see this behavior in the case that the prior value for the proportion of individuals belonging to the one of the two groups was 0.5, corresponding to a uniform prior on the group assignment variable (Ei).

TABLE 3.

Cross-tabulations of posterior mean classifications with respect to their original groups

(Original group, estimated group)
Data (%) (1, 1) (1, 2) (2, 1) (2, 2) Prop of inds
0 202 42 51 169 0.45
5 198 46 50 171 0.46
10 195 49 49 171 0.47
15 193 51 54 166 0.47
20 188 56 56 164 0.47

Different percentages (0, 5, 10, 15, and 20%) indicate the different data sets with corresponding percentage of missing values. “Prop of inds” indicates the posterior mean estimated proportion of individuals that are members in the smaller of the two groups (which happened to be group 2 here) estimated from sampled values for hyperparameter κ2. For all data sets, the estimates are presented on the basis of the group labels giving the best fit. Note that only a single MCMC chain was utilized in the 10% data set.

Estimated overall means:

Table 4 shows how well the group-specific overall mean parameters (true values are 2.3 and −2.3) were estimated from five different data sets using the proposed approach and the single-group analysis (with the constraint that all individuals belong to group 1). Although only the posterior means are shown, they can still illustrate how estimates have been shrunk toward zero by increasing the amount of missing values or assuming only a single group. Note that the single-group analysis was performed only for the data set with 5% missing observations.

TABLE 4.

The posterior mean estimates of overall mean in two-group and single-group analyses

Group
Data (%) 1 2 Single-group analysis
0 1.66 −1.23
5 1.95 −1.16 0.43
10 1.40 −1.07
15 1.53 −0.86
20 1.39 −0.82

Different percentages (0, 5, 10, 15, and 20%) indicate different data sets with corresponding percentage of missing values. For all data sets, the estimates are presented on the basis of the group labels giving the best fit. Note that only a single MCMC chain was utilized in the 10% data set.

REAL DATA ANALYSIS

Cystic fibrosis data, model, and analysis:

As in Sillanpää and Bhattacharjee (2005), we selected a well-known cystic fibrosis (CF) data set (Kerem et al. 1989), with binary phenotype and 23 biallelic markers collected from 93 individuals. The markers in this data set span the 1.7-Mb region surrounding the cystic fibrosis transmembrane regulator (CFTR) gene on chromosomal segment 7q31. The data set contains the haplotype information (without individual identities) and the physical distances (Kerem et al. 1989; Morris et al. 2000), and there is some degree of missing alleles. As earlier (Sillanpää and Bhattacharjee 2005), we use a marker-dependence prior (with Inline graphic) to utilize physical distances and the model where each individual contributed two independent observations (phenotype and one allele in each locus) to the analysis. The prior for the overall smoothing parameter is Gamma(1, 0.01) with prior mean at 100. A difference from our earlier analysis (Sillanpää and Bhattacharjee 2005) is that here the haplotype information is utilized—each haplotype is classified into one of the two etiology groups. By using such an independent-observation idea (Sasieni 1997), sample size is doubled, only a single allelic coefficient is fitted in each selected locus for each observation, and estimated effects are approximately double in size. Note that here we are effectively carrying out a heterogeneity analysis for the alleles with respect to their parental origins. Note also that, unlike Sillanpää and Bhattacharjee (2005), we use a random variance model for genetic effects and the prior for missing values where each allele is considered to be a priori equally probable at each marker locus. Because of the binary phenotype, we do not have prior for precision parameter. Otherwise, we use the same priors as in the simulated data analyses.

The parameter estimation was done in WinBUGS 1.3 (Gilks et al. 1994; Spiegelhalter et al. 1999), using a Pentium 4, 3.40 GHz. Two parallel MCMC chains each of length 7800 were run with random initial values that took ∼29 hr. Because of 300 burn-in iterations per chain and no thinning, this resulted in 15,000 pooled MCMC samples in total. No evidence of label switching or convergence problems was detected by our visual inspection of MCMC chains for several different parameters.

Results of CF data:

In Figure 6, we present values of weighted genetic variances estimated for two etiology groups. Only locus 2 shows a highly elevated value in group 1 whereas we can see two elevated peaks at positions 10 and 17 in group 2. The estimated proportions of haplotypes classified into two groups, respectively, were ∼0.2 and 0.8, and individual assignment probabilities were found to be surprisingly high, making unambiguous membership identification possible for most of the haplotypes. The two positions (10 and 17) as well as their weighted genetic variance peaks in group 2 are very similar to what was found in the single-group analysis of Sillanpää and Bhattacharjee (2005). (Note that only QTL probabilities were shown in Sillanpää and Bhattacharjee 2005). However, one can see how the peak of position 2 in group 1 has grown much higher in the two-group analysis, changing the overall conclusion for that position. Note that Lazzeroni (1998) also found a notably high peak at position 2.

Figure 6.—

Figure 6.—

Locus-specific point estimates (posterior mean) of weighted genetic variances in two groups for the CF data. Marker numbers are shown on the x-axis and weighted genetic variances (expressed in logit scale) on the y-axis. ▪, group 1; □, group 2.

Further exploration of the haplotype data was performed for a subset of haplotypes that showed very high assignment probability (>0.75) to one of the two groups. This led us to 169 classified haplotypes of a total of 186. This exercise revealed that although locus 2 was identified as an influential position in group 1, allele frequencies at this marker were not visibly different for two groups when compared to frequencies calculated from the whole data set. Surprisingly, it was locus 17 that showed a remarkable allele frequency difference between the two etiology groups (suggesting that position 17 could actually be a stratifying factor). Moreover, a large proportion of haplotypes in group 1 were observed to have allele 2 at locus 17 co-occurring with allele 1 at locus 2 (suggesting the existence of epistatic interaction between loci 17 and 2). At locus 10, we observed a minor increase of frequency of allele 2 in group 1 and of allele 1 in group 2 but the difference was much milder than that at locus 17.

Finally, we wanted to check whether the estimated stratification of haplotypes (in the two groups) is connected to any of the known stratifiers—mutation subsets presented in Table 3 of Kerem et al. (1989). The membering CF haplotypes are identified according to whether they carry the ΔF508 or other mutations and are additionally within each of two mutation groups—those with pancreatic insufficiency (PI) or pancreatic sufficiency (PS). The inspections were done by monitoring the posterior estimated number of patient haplotypes that were classified to group 1 and that also had the non-ΔF508 mutation, posterior estimated number of haplotypes classified to group 1, and so on. (This monitoring was based on a separate MCMC run.) On the basis of the posterior estimates we observed that the frequency of CF haplotypes containing the non-ΔF508 mutation showed clear enrichment in group 1 (the smaller group with locus 2 as the main QTL) while the frequency of haplotypes carrying the ΔF508 mutation showed no difference. From this we may conclude/suggest that if there is epistatic interaction between loci 17 and 2, it is likely to occur in haplotypes with the non-ΔF508 mutation on the CF chromosome.

Type 2 diabetes mellitus and model:

These data were first published in a positional cloning study of Horikawa et al. (2000) and a subset of data has been reanalyzed by Zöllner and Pritchard (2005). We use the same subset of data as Zöllner and Pritchard (2005), which have a binary phenotype (108 cases and 112 controls) and 85 SNP markers (with physical distances) spanning the 876-kb area in the NIDDM1 region on chromosome 2. However, there is one important difference between our approaches: before the actual association analysis, Zöllner and Pritchard (2005) completed missing alleles with their most likely values and estimated haplotypes using the PHASE program whereas we use raw genotype data with missing values directly in our analysis. We use the model where every individual contributes a single observation (phenotype and two alleles at each locus) to the analysis. We use a random variance model for genetic effects and the prior for missing values where each allele is considered to be a priori equally probable at each marker locus.

The parameters were estimated using two different values of shrinkage parameter: Inline graphic (shrinkage) and Inline graphic (no shrinkage) in WinBUGS 1.3 (Gilks et al. 1994; Spiegelhalter et al. 1999), using a Pentium 4, 3.40 GHz. Two parallel MCMC chains each of length 5500 and 3750 (no shrinkage) were run with random initial values, which took ∼25 sec per iteration (with two parallel chains). By having burn in of 500 and 2500 (no shrinkage) iterations per chain and no thinning, the estimations were based on 10,000 and 2500 (no shrinkage) pooled MCMC samples in total. Again, no problems in label switching or convergence were detected.

Results of diabetes data:

The analysis with shrinkage parameter Inline graphic and marker-independence prior (no distances) ended up with a clear signal at 269 kb in group 1. The estimated proportion of individuals therein is 0.73. To see if any additional positions can be found by relaxing the stringency on the shrinkage parameter, another analysis was executed with the shrinkage paramameter Inline graphic. (Note that half of the genes are assumed to be associated a priori under this setting.) As presented in Figure 7, two additional positions at 160 and 161 kb were identified to be active in both groups. One can see that there are weak signals also at 269 and 400 kb in group 2. The estimated proportion of individuals in group 1 is 0.62. The estimated position of disease mutation in Zöllner and Pritchard (2005) is at 131 kb and the three SNPs that make up the haplotype at CAPN-10 (Horikawa et al. 2000) are located at 121, 124, and 134 kb. Overall, our estimates are somewhat to the right of their estimates. Zöllner and Pritchard (2005) emphasized that the presence of other genes cannot be excluded because their signal was only modestly significant.

Figure 7.—

Figure 7.—

The point estimates (posterior mean) of weighted genetic variances in two groups for the diabetes data with the model having marker-independence prior and without shrinkage (Inline graphic). Positions are shown on the x-axis and weighted genetic variances (expressed in logit scale) on the y-axis.

Finally, the complementary analysis was executed with the model using the marker-dependence prior (with distances). This analysis seemed to confirm the locations that were already found in the marker-independence prior analysis (details and results not shown).

DISCUSSION

Presence of genetic heterogeneity:

Generally the most efficient ways to reduce genetic heterogeneity of the trait in the sample are applicable only in the design stage of the study. Examples of those practices are: (1) careful phenotype definition (Leboyer et al. 1998; Sillanpää 2002) based on expression profiling (Kraft and Horvath 2003) or proteomics data (Semmes 2004), (2) careful choice of study population (Wright et al. 1999; Peltonen 2000; Shifman and Darvasi 2001), and (3) careful sample collection such as utilization of the known history of the population isolate in ascertainment (Heath et al. 2001). In this article, we emphasize that we have considered the case where such preventative actions cannot be applied or they have not been sufficient in homogenizing the sample. Moreover, one can consider using this method as a test of presence of genetic heterogeneity or as a primary analysis when there is already some prior evidence or a hint about genetic heterogeneity obtained by statistical or biological means: (1) association testing cannot find a signal, (2) findings from earlier studies cannot be replicated or are contradictory, (3) the population consists of individuals from several ethnic backgrounds, or (4) there is multimodality of the phenotype distribution. Note that one can also use this approach to argue if (self-reported or estimated) ethnic background can be used as a known factor giving specification for the etiological groups.

The presented method:

We have presented a new method for studying multilocus association between multiple marker loci and a quantitative or binary trait in the presence of genetic heterogeneity. The method can handle multiallelic marker loci with some degree of missing genotypes. On the basis of tested examples, the amount of missing data should not be too large (roughly not more than 15%). We expect this to be even more critical in binary traits, where phenotype carries less information. This method can be implemented with WinBUGS (Gilks et al. 1994; Spiegelhalter et al. 1999), which automatically calculates data-specific tuning parameters needed in a Metropolis–Hastings random walk (Chib and Greenberg 1995). Additionally, by assuming a priori independence of the indicators and genetic effects (cf. Kuo and Mallick 1998), we can avoid the existence of other difficult tuning parameters like those needed in stochastic search variable selection (George and McCulloch 1993; Yi et al. 2003a). The method allows for use of external covariates such as age, sex, environmental risk factors, gene-expression measurements (see Hoti and Sillanpää 2006), and dominance in the genetic model. However, we do not claim that this approach is robust for bias of population stratification although logistic regression should be a relatively safe approach in the case of binary phenotype (see Setakis et al. 2006). If marker data have been collected from some specific chromosomal interval, one can control for the problem of population stratification (and other confounders) by using the marker-dependence prior of locus indicators to incorporate physical or genetic map distances, similarly as in Sillanpää and Bhattacharjee (2005). In other cases, we propose the use of self-identified ethnicity or structured association to handle the problem of population stratification (Pritchard et al. 2000a,b; Sillanpää et al. 2001; Corander et al. 2003, 2004; Hoggart et al. 2003). Note that a self-reported population identity given by a person (or identity estimated by STRUCTURE or BAPS software) is a putative classification covariate that could be handled as an extra “marker locus” in the presented method.

Stratifying factor and label switching:

As an output, our method provides estimates for group-specific genetic models (QTL) and individual assignments to the groups. In the estimated groups and their QTL, one cannot distinguish if the group-specific locus was identified because of genetic heterogeneity or because of interaction between the locus and some unmeasured (or unused) environmental covariate (see Thorton-Wells et al. 2004). One can, however, check this afterward with respect to existing (but unused) covariates that are available, but generally this is an intractable problem. One should keep this in mind when conclusions are drawn from the analysis. Like some other sophisticated population assignment methods in genetics, our method is likely to suffer from the problem of label switching between different MCMC chains (Pritchard et al. 2000b; Stephens 2000). We did not encounter label switching within any single MCMC chain in our test examples. This (no label switching) simplifies interpretation of the results (groups have labels) but it is also an indication of poor mixing of the MCMC sampler, which one should be aware of and monitor carefully. When label switching occurs across chains (but not within a chain), one should calculate posterior estimates on the basis of samples from a single long MCMC chain instead of pooling samples together from several shorter MCMC runs (even if monitoring the convergence and detection of this problem is based on running several MCMC chains). Moreover, one can also apply the conditional analysis for unclear cases to dissect the underlying structure further, as was illustrated in results. As a conclusion, we emphasize that even if label switching within a single MCMC chain happens, one can still obtain valuable information about the existence of genetic heterogeneity from this analysis when compared to the results of single-group analysis. That is, if the same set of loci is supported by both groups and this set differs from single-group analysis, one can conclude that it is likely that there is genetic heterogeneity and label switching.

Computation:

As was illustrated in the examples, our approach requires a substantial amount of computation time. Therefore, the current WinBUGS implementation may not directly scale to extremely large sizes of data (with thousands of individuals) or large marker sets (with several hundreds of markers). Thus, for large-scale data sets: (1) the size of the problem can be reduced by focusing on a subgroup of genes (e.g., on the basis of known pathways, specific genomic regions, or preliminary screening or dimension-reduction techniques), and/or (2) one can pursue faster implementation of the method by using C-language, and (3) one can concentrate on the estimating mode (maximum a posteriori, MAP) of the posterior distribution only. However, an extra level of difficulty in methods 2 and 3 above (avoided here by WinBUGS) is to find a good updating scheme for the MCMC sampler. Finally, it is also possible in a Bayesian framework to first divide individuals in the data into the smaller data sets and analyze them separately. Because the posterior distribution (results) from one analysis can always be taken as a prior distribution to the next analysis, it is possible to combine separate analyses together afterward and to produce still a meaningful summary. However, we do not recommend such a “split and combine” approach here, because of the weak identifiability of the groups and potential labeling problems.

Differences between methods:

Because the proposed method can be seen as an extension of the earlier work (Sillanpää et al. 2001; Kilpikari and Sillanpää 2003; Sillanpää and Bhattacharjee 2005) we briefly comment on the differences between these methods:

  1. Population assignment: Sillanpää et al. (2001) also considered a population assignment as here but the set of neutral markers were included as an extra information source (which is useful for correcting population stratification but not neccessarily for genetic heterogeneity). Kilpikari and Sillanpää (2003) and Sillanpää and Bhattacharjee (2005) considered only a case of single-population analysis.

  2. Locus indicators and a random variance model: Sillanpää et al. (2001) and Kilpikari and Sillanpää (2003) employed the reversible-jump MCMC for candidate (QTL) selection and a fixed-variance model (fixed-effect model) for QTL effects, whereas Sillanpää and Bhattacharjee (2005) as well as the approach here utilized locus-indicator variables for genetic model selection and a random-variance model (variance-component model) for QTL effects. However, the reversible-jump MCMCs in these approaches were implemented in a way that resembles locus-indicator models (see Kilpikari and Sillanpää 2003).

  3. Two groups share a common variance: The special property of the model presented here is that the group-specific effects (at each locus) are assumed to be exchangeable; that is, they share a distribution with the common variance component. In the case that the QTL (marker) is active in both groups, all observations contribute to the estimation of the corresponding genetic variance. This means that in the homogeneous case, where the same sets of QTL are simultaneously present in two groups, the approach can maintain the power comparable to that of the single-group analysis. Sillanpää et al. (2001) considered an individual parameter set for each group, which is not optimal in the homogeneous case.

Number of groups:

In the analysis, we have assumed that the number of etiology groups is two and that the residual variance (precision) parameter is common for both groups. If the number of true underlying groups is one this is not a problem (see above paragraph). However, the method can also be generalized for a larger (more than two) number of groups if there is some a priori information about the groups and identifiability is not a problem. Nevertheless, the number of distinct modes of the phenotype distribution should not be directly used as a number of groups (cf. Figure 2). For a large number of groups, it may still be best to assume only two groups but use group-specific residual variances. This kind of model would follow the spirit of an admixture model, where one tries to fit a common disease model only for the homogeneous part of the data and puts the rest of the data into a “junk group,” which can include phenocopies and all kinds of individuals from multiple genetic backgrounds.

Epistasis:

We share the view (Sillanpää 2002; Moore 2003; Sillanpää and Auranen 2004; Thorton-Wells et al. 2004) that the dissection of complex traits requires novel methods that are capable of handling both the genetic heterogeneity and complex genetic architecture. The current approach does use multilocus estimation and can account for genetic heterogeneity but does not explicitly model epistatic interactions. Although it may be easy to include known pairs of epistatic markers in the model (e.g., based on database information on known pathways; see Thomas 2005), detection of epistasis is generally a difficult task that requires a suitable combination of the following: specific designs, large data sets, appropriate statistical modeling, and high computing power. In spite of these difficulties, an extension of the presented approach that includes epistasis and utilizes the modern ideas of stochastic partitioning where genotypes from multiple loci are stochastically pooled into a smaller number of classes (see Moore 2003; Sillanpää and Auranen 2004) would be worth of considering in the future. For current Bayesian interaction approaches, see Conti et al. (2003), Yi and Xu (2002), Yi et al. (2003b, 2005), Narita and Sasaki (2004), and Zhang and Xu (2005).

The model specification code (written in WinBUGS) is freely available for research purposes at http://www.rni.helsinki.fi/∼mjs/.

Acknowledgments

We are grateful to Nancy Cox, Jonathan Pritchard, and Sebastian Zöllner for their help with the diabetes data; to Andrew Morris, Kung-Yee Liang, and Yen-Feng Chiu for their help with the CF data; and to Andrew Thomas, Duncan C. Thomas, and two anonymous referees for their constructive comments on the manuscript. This work was supported by research grant no. 202324 from the Academy of Finland.

References

  1. Balding, D., A. D. Carothers, Y. L. Marchini, L. R. Cardon, A. Vetta et al., 2002. Discussion on the meeting on statistical modelling and analysis of genetic data. J. R. Stat. Soc. B 64: 737–775. [Google Scholar]
  2. Broman, K. W, and T. P. Speed, 2002. A model selection approach for identification of quantitative trait loci in experimental crosses. J. R. Stat. Soc. B 64: 641–656. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Bull, S. B., L. Mirea, L. Briollais and A. G. Logan, 2003. Heterogeneity in IBD allele sharing among covariate-defined subgroups: issues and findings for affected relatives. Hum. Hered. 56: 94–106. [DOI] [PubMed] [Google Scholar]
  4. Burchard, E. G., Z. Elad, N. Coyle, S. L. Gomez, H. Tang et al., 2003. The importance of race and ethnic background in biomedical research and clinical practice. N. Engl. J. Med. 348: 1170–1175. [DOI] [PubMed] [Google Scholar]
  5. Cardon, L. R., and L. J. Palmer, 2003. Population stratification and spurious allelic association. Lancet 361: 598–604. [DOI] [PubMed] [Google Scholar]
  6. Celeux, G., M. Hurn and C. P. Robert, 2000. Computational and inferential difficulties with mixture posterior distributions. J. Am. Stat. Assoc. 95: 957–970. [Google Scholar]
  7. Chib, S., and E. Greenberg, 1995. Understanding the Metropolis-Hastings algorithm. Am. Stat. 49: 327–335. [Google Scholar]
  8. Clayton, D., 2001. Population association, pp. 519–540 in Handbook of Statistical Genetics, edited by D. J. Balding, M. Bishop and C. Cannings. Wiley, Chichester, UK.
  9. Conti, D. V., and J. S. Witte, 2003. Hierarchical modeling of linkage disequilibrium: genetic structure and spatial relations. Am. J. Hum. Genet. 72: 351–363. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Conti, D. V., V. Cortessis, J. Molitor and D. C. Thomas, 2003. Bayesian modeling of complex metabolic pathways. Hum. Hered. 56: 83–93. [DOI] [PubMed] [Google Scholar]
  11. Cooper, R. S., J. S. Kaufman and R. Ward, 2003. Race and genomics. N. Engl. J. Med. 348: 1166–1170. [DOI] [PubMed] [Google Scholar]
  12. Corander, J., P. Waldmann and M. J. Sillanpää, 2003. Bayesian analysis of genetic differentiation between populations. Genetics 163: 367–374. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Corander, J., P. Waldmann, P. Marttinen and M. J. Sillanpää, 2004. BAPS 2: enhanced possibilities for the analysis of the genetic population structure. Bioinformatics 20: 2363–2369. [DOI] [PubMed] [Google Scholar]
  14. Devlin, B., and K. Roeder, 1999. Genomic control for association studies. Biometrics 55: 997–1004. [DOI] [PubMed] [Google Scholar]
  15. Devlin, B., K. Roeder and L. Wasserman, 2003. Analysis of multilocus models of association. Genet. Epidemiol. 25: 36–47. [DOI] [PubMed] [Google Scholar]
  16. Durrant, C., K. T. Zondervan, L. R. Cardon, S. Hunt, P. Deloukas et al., 2004. Linkage disequilibrium mapping via cladistic analysis of single-nucleotide polymorphism haplotypes. Am. J. Hum. Genet. 75: 35–43. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Ekstrom, C. T., and P. Dalgaard, 2003. Linkage analysis of quantitative trait loci in the presence of heterogeneity. Hum. Hered. 55: 16–26. [DOI] [PubMed] [Google Scholar]
  18. Epstein, M. P., C. D. Veal, R. C. Trembath, J. N. W. N. Barker, C. Li et al., 2005. Genetic association analysis using data from triads and unrelated subjects. Am. J. Hum. Genet. 76: 592–608. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Ewens, W. J., and R. S. Spielman, 1995. The transmission/disequilibrium test: history, subdivision, and admixture. Am. J. Hum. Genet. 57: 445–464. [PMC free article] [PubMed] [Google Scholar]
  20. Flint, J., and R. Mott, 2001. Finding the molecular basis of quantitative traits: successes and pitfalls. Nat. Rev. Genet. 2: 437–445. [DOI] [PubMed] [Google Scholar]
  21. Foster, M. W., and R. R. Sharp, 2004. Beyond race: towards a whole-genome perspective on human populations and genetic variation. Nat. Rev. Genet. 5: 790–796. [DOI] [PubMed] [Google Scholar]
  22. Gasbarra, D., M. J. Sillanpää and E. Arjas, 2005. Backward simulation of ancestors of sampled individuals. Theor. Popul. Biol. 67: 75–83. [DOI] [PubMed] [Google Scholar]
  23. George, E. I., and R. E. McCulloch, 1993. Variable selection via Gibbs sampling. J. Am. Stat. Assoc. 88: 881–889. [Google Scholar]
  24. Gilks, W. R., A. Thomas and D. J. Spiegelhalter, 1994. A language and program for complex Bayesian modeling. Statistician 43: 169–178. [Google Scholar]
  25. Grigull, J., R. Alexandrova and A. D. Paterson, 2001. Clustering of pedigrees using marker allele frequencies: impact on linkage analysis. Genet. Epidemiol. 21(Suppl. 1): S61–S66. [DOI] [PubMed]
  26. Hauser, E. R., R. M. Watanabe, W. L. Duren, M. P. Bass, C. D. Langefeld et al., 2004. Ordered subset analysis in genetic linkage mapping of complex traits. Genet. Epidemiol. 27: 53–63. [DOI] [PubMed] [Google Scholar]
  27. Heath, S., R. Robledo, W. Beggs, G. Feola, C. Parodo et al., 2001. A novel approach to search for identity by descent in small samples of patients and controls from the same Mendelian breeding unit: a pilot study on Myopia. Hum. Hered. 52: 183–190. [DOI] [PubMed] [Google Scholar]
  28. Hinds, D. A., R. P. Stokowski, N. Patil, K. Konvicka, D. Kershenobich et al., 2004. Matching strategies for genetic association studies in structured populations. Am. J. Hum. Genet. 74: 317–325. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Hodge, S. E., V. J. Vieland and D. A. Greenberg, 2002. HLODs remain powerful tools for detection of linkage in the presence of genetic heterogeneity. Am. J. Hum. Genet. 70: 556–558. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Hoggart, C. J., E. J. Parra, M. D. Shriver, C. Bonilla, R. A. Kittles et al., 2003. Control of confounding in genetic associations in stratified populations. Am. J. Hum. Genet. 72: 1492–1504. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Horikawa, Y., N. Oda, N. Cox, X. Li, M. Orho-Melander et al., 2000. Genetic variation in the gene encoding calpain-10 is associated with type 2 diabetes mellitus. Nat. Genet. 26: 163–175. [DOI] [PubMed] [Google Scholar]
  32. Hoti, F., and M. J. Sillanpää, 2006. Bayesian mapping of genotype × expression interactions in quantitative and qualitative traits. Heredity 97: 4–18. [DOI] [PubMed] [Google Scholar]
  33. Hoti, F., A. Tuulio-Hendriksson, J. Haukka, T. Partonen, L. Holmström et al., 2004. Family-based clusters of cognitive test performance in familial schizophrenia. BMC Psychiatry 4: 20. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. International HapMap Consortium, 2003. The international HapMap project. Nature 426: 789–796. [DOI] [PubMed] [Google Scholar]
  35. International HapMap Consortium, 2005. A haplotype map of the human genome. Nature 437: 1299–1320. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Kass, R. E., B. P. Carlin, A. Gelman and R. M. Neal, 1998. Markov chain Monte Carlo in practice: a roundtable discussion. Am. Stat. 52: 93–100. [Google Scholar]
  37. Kazeem, G. R., and M. Farrall, 2005. Integrating case-control and TDT studies. Ann. Hum. Genet. 69: 329–335. [DOI] [PubMed] [Google Scholar]
  38. Kerem, B.-S., J. M. Rommens, J. A. Buchanan, D. Markiewicz, T. K. Cox et al., 1989. Identification of the cystic fibrosis gene: genetic analysis. Science 245: 1073–1080. [DOI] [PubMed] [Google Scholar]
  39. Kilpikari, R., and M. J. Sillanpää, 2003. Bayesian analysis of multilocus association in quantitative and qualitative traits. Genet. Epidemiol. 25: 122–135. [DOI] [PubMed] [Google Scholar]
  40. Knapp, M., and T. Becker, 2003. Family-based association analysis with tightly linked markers. Hum. Hered. 56: 2–9. [DOI] [PubMed] [Google Scholar]
  41. Kraft, P., and S. Horvath, 2003. The genetics of gene expression and gene mapping. Trends Biotechnol. 21: 377–378. [DOI] [PubMed] [Google Scholar]
  42. Kuo, L., and B. Mallick, 1998. Variable selection for regression models. Sankhya Ser. B 60: 65–81. [Google Scholar]
  43. Lander, E. S., and N. J. Schork, 1994. Genetic dissection of complex traits. Science 265: 2037–2048. [DOI] [PubMed] [Google Scholar]
  44. Lazzeroni, L. C., 1998. Linkage disequilibrium and gene mapping: an empirical least-squares approach. Am. J. Hum. Genet. 62: 159–170. [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Leal, S. M., 1997. Tests for detecting linkage and linkage heterogeneity, pp. 97–112 in Genetic Mapping of Disease Genes, edited by I. H. Pawlowitzki, J. H. Edwards and E. A. Thompson.. Academic Press, San Diego.
  46. Leal, S. M., and J. Ott, 2000. Effects of stratification in the analysis of affected-sib-pair data: benefits and costs. Am. J. Hum. Genet. 66: 567–575. [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Leboyer, M., F. Bellivier, M. Nosten-Bertrand, R. Jouvent, P. David et al., 1998. Psychiatric genetics: search for phenotypes. Trends Neurosci. 21: 102–105. [DOI] [PubMed] [Google Scholar]
  48. Lin, Z., and R. B. Altman, 2004. Finding haplotype tagging SNPs by use of principal component analysis. Am. J. Hum. Genet. 75: 850–861. [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Lohmueller, K. E., C. L. Pearce, M. Pike, E. S. Lander and J. N. Hirchorn, 2003. Meta-analysis of genetic association studies supports a contribution of common variants to susceptibility to common diseases. Nat. Genet. 33: 177–182. [DOI] [PubMed] [Google Scholar]
  50. Longmate, J. A., 2001. Complexity and power in case-control association studies. Am. J. Hum. Genet. 68: 1229–1237. [DOI] [PMC free article] [PubMed] [Google Scholar]
  51. Marchini, J., L. R. Cardon, M. S. Phillips and P. Donnelly, 2004. The effects of human population structure on large genetic association studies. Nat. Genet. 36: 512–517. [DOI] [PubMed] [Google Scholar]
  52. Meng, Z., D. V. Zaykin, C.-F. Xu, M. Wagner and M. G. Ehm, 2003. Selection of genetic markers for association analyses, using linkage disequilibrium and haplotypes. Am. J. Hum. Genet. 73: 115–130. [DOI] [PMC free article] [PubMed] [Google Scholar]
  53. Molitor, J., P. Marjoram and D. Thomas, 2003. a Fine-scale mapping of disease genes with multiple mutations via spatial clustering techniques. Am. J. Hum. Genet. 73: 1368–1384. [DOI] [PMC free article] [PubMed] [Google Scholar]
  54. Molitor, J., P. Marjoram and D. Thomas, 2003. b Application of Bayesian spatial statistical methods to analysis of haplotypes effects and gene mapping. Genet. Epidemiol. 25: 95–105. [DOI] [PubMed] [Google Scholar]
  55. Moore, J. H., 2003. The ubiquitous nature of epistasis in determining susceptibility to common human diseases. Hum. Hered. 56: 73–82. [DOI] [PubMed] [Google Scholar]
  56. Morris, A., J. C. Whittaker and D. J. Balding, 2000. Bayesian fine-scale mapping of disease loci, by hidden Markov models. Am. J. Hum. Genet. 67: 155–169. [DOI] [PMC free article] [PubMed] [Google Scholar]
  57. Morris, A. P., J. C. Whittaker and D. J. Balding, 2002. Fine-scale mapping of disease loci via shattered coalescent modeling of genealogies. Am. J. Hum. Genet. 70: 686–707. [DOI] [PMC free article] [PubMed] [Google Scholar]
  58. Morris, A. P., J. C. Whittaker, C.-F. Xu, L. K. Hosting and D. J. Balding, 2003. Multipoint linkage-disequilibrium mapping narrows location interval and identifies mutation heterogeneity. Proc. Natl. Acad. Sci. USA 100: 13442–13446. [DOI] [PMC free article] [PubMed] [Google Scholar]
  59. Narita, A., and Y. Sasaki, 2004. Detection of multiple QTL with epistatic effects under a mixed inheritance model in an outbred population. Genet. Sel. Evol. 36: 415–433. [DOI] [PMC free article] [PubMed] [Google Scholar]
  60. Peltonen, L., 2000. Positional cloning of disease genes: advantages of genetic isolates. Hum. Hered. 50: 66–75. [DOI] [PubMed] [Google Scholar]
  61. Pritchard, J. K., M. Stephens, N. A. Rosenberg and P. Donnelly, 2000. a Association mapping in structured populations. Am. J. Hum. Genet. 67: 170–181. [DOI] [PMC free article] [PubMed] [Google Scholar]
  62. Pritchard, J. K., M. Stephens and P. Donnelly, 2000. b Inference of population structure using multilocus genotype data. Genetics 155: 945–959. [DOI] [PMC free article] [PubMed] [Google Scholar]
  63. Province, M. A., W. D. Shannon and D. C. Rao, 2001. Classification methods for confronting heterogeneity. Adv. Genet. 42: 273–286. [DOI] [PubMed] [Google Scholar]
  64. Rebbeck, T. R., M. E. Matrinez, T. A. Sellers, P. G. Shields, C. P. Wild et al., 2004. Genetic variation and cancer: improving the environment for publication of association studies. Cancer Epidemiol. Biomarkers Prev. 13: 1985–1986. [PubMed] [Google Scholar]
  65. Richardson, S., and P. J. Green, 1997. On Bayesian analysis of mixtures with an unknown number of components. J. R. Stat. Soc. B 59: 731–792. [Google Scholar]
  66. Risch, N., and K. Merikangas, 1996. The future of genetic studies of complex human diseases. Science 273: 1616–1617. [DOI] [PubMed] [Google Scholar]
  67. Sasieni, P. D., 1997. From genotypes to genes: doubling the sample size. Biometrics 53: 1253–1261. [PubMed] [Google Scholar]
  68. Schaid, D. J., S. K. McDonnell and S. N. Thibodeau, 2001. Regression models for linkage heterogeneity applied to familial prostate cancer. Am. J. Hum. Genet. 68: 1189–1196. [DOI] [PMC free article] [PubMed] [Google Scholar]
  69. Schork, N. J., D. Fallin, B. Thiel, X. Xu, U. Broeckel et al., 2001. The future of genetic case/control studies. Adv. Genet. 42: 191–211. [DOI] [PubMed] [Google Scholar]
  70. Seaman, S. R., S. Richardson, I. Stücker and S. Benhamou, 2002. A Bayesian partition model for case-control studies on highly polymorphic candidate genes. Genet. Epidemiol. 22: 356–368. [DOI] [PubMed] [Google Scholar]
  71. Semmes, O. J., 2004. Defining the role of mass spectrometry in cancer diagnostics. Cancer Epidemiol. Biomarkers Prev. 13: 1555–1557. [PubMed] [Google Scholar]
  72. Setakis, E., H. Stirnadel and D. J. Balding, 2006. Logistic regression protects against population structure in genetic association studies. Genome Res. 16: 290–296. [DOI] [PMC free article] [PubMed] [Google Scholar]
  73. Shannon, W. D., M. A. Province and D. C. Rao, 2001. Tree-based recursive partitioning methods for subdividing sibpairs into relatively more homogeneous subgroups. Genet. Epidemiol. 20: 293–306. [DOI] [PubMed] [Google Scholar]
  74. Shifman, S., and A. Darvasi, 2001. The value of isolated populations. Nat. Genet. 28: 309–310. [DOI] [PubMed] [Google Scholar]
  75. Sillanpää, M. J., 2002. Mathematics-assisted mapping in analysis of medical disease. Ann. Med. 34: 291–298. [DOI] [PubMed] [Google Scholar]
  76. Sillanpää, M. J., and K. Auranen, 2004. Replication in genetic studies of complex traits. Ann. Hum. Genet. 68: 646–657. [DOI] [PubMed] [Google Scholar]
  77. Sillanpää, M. J., and M. Bhattacharjee, 2005. Bayesian association-based fine mapping in small chromosomal segments. Genetics 169: 427–439. [DOI] [PMC free article] [PubMed] [Google Scholar]
  78. Sillanpää, M. J., and J. Corander, 2002. Model choice in gene mapping: what and why. Trends Genet. 18: 301–307. [DOI] [PubMed] [Google Scholar]
  79. Sillanpää, M. J., R. Kilpikari, S. Ripatti, P. Onkamo and P. Uimari, 2001. Bayesian association mapping for quantitative traits in a mixture of two populations. Genet. Epidemiol. 21(Suppl. 1): S692–S699. [DOI] [PubMed] [Google Scholar]
  80. Smith, C. A. B., 1963. Testing for heterogeneity of recombination values in human genetics. Ann. Hum. Genet. 27: 175–182. [DOI] [PubMed] [Google Scholar]
  81. Stephens, M., 2000. Dealing with label switching in mixture models. J. R. Stat. Soc. B 62: 795–809. [Google Scholar]
  82. Spiegelhalter, D. J., A. Thomas and N. G. Best, 1999. WinBUGS Version 1.2 User Manual. MRC Biostatistics Unit, Institute of Public Health, Cambridge, UK.
  83. Terwilliger, J. D., and K. M. Weiss, 1998. Linkage disequilibrium mapping of complex disease: Fantasy or reality? Curr. Opin. Biotechnol. 9: 578–594. [DOI] [PubMed] [Google Scholar]
  84. Thomas, D. C., 2005. The need for a systematic approach to complex pathways in molecular epidemiology. Cancer Epidemiol. Biomarkers Prev. 14: 557–559. [DOI] [PubMed] [Google Scholar]
  85. Thomas, D. C., J. L. Morrison and D. G. Clayton, 2001. Bayes estimates of haplotype effects. Genet. Epidemiol. 21(Suppl. 1): S712–S717. [DOI] [PubMed] [Google Scholar]
  86. Thomson, G., 1995. Mapping disease genes: family-based association studies. Am. J. Hum. Genet. 57: 487–498. [PMC free article] [PubMed] [Google Scholar]
  87. Thorton-Wells, T. A., J. H. Moore and J. L. Haines, 2004. Genetics, statistics and human disease: analytical retooling for complexity. Trends Genet. 20: 640–647. [DOI] [PubMed] [Google Scholar]
  88. Uimari, P., and I. Hoeschele, 1997. Mapping linked quantitative trait loci using Bayesian analysis amd Markov chain Monte Carlo algorithms. Genetics 146: 735–743. [DOI] [PMC free article] [PubMed] [Google Scholar]
  89. Uimari, P., and M. J. Sillanpää, 2001. Bayesian oligogenic analysis of quantitative and qualitative traits in general pedigrees. Genet. Epidemiol. 21: 224–242. [DOI] [PubMed] [Google Scholar]
  90. Whittemore, A. S., and J. Halpern, 2001. Problems in the definition, interpretation, and evaluation of genetic heterogeneity. Am. J. Hum. Genet. 68: 457–465. [DOI] [PMC free article] [PubMed] [Google Scholar]
  91. Wright, A. F., A. D. Carothers and M. Pirastu, 1999. Population choice in mapping genes for complex diseases. Nat. Genet. 23: 397–404. [DOI] [PubMed] [Google Scholar]
  92. Yi, N., 2004. A unified Markov chain Monte Carlo framework for mapping multiple quantitative trait loci. Genetics 167: 967–975. [DOI] [PMC free article] [PubMed] [Google Scholar]
  93. Yi, N., and S. Xu, 2002. Mapping quantitative trait loci with epistatic effects. Genet. Res. 79: 185–198. [DOI] [PubMed] [Google Scholar]
  94. Yi, N., V. George and D. B. Allison, 2003. a Stochastic search variable selection for identifying multiple quantitative trait loci. Genetics 164: 1129–1138. [DOI] [PMC free article] [PubMed] [Google Scholar]
  95. Yi, N., S. Xu and D. B. Allison, 2003. b Bayesian model choice and search strategies for mapping interacting quantitative trait loci. Genetics 165: 867–883. [DOI] [PMC free article] [PubMed] [Google Scholar]
  96. Yi, N., B. S. Yandell, G. A. Churchill, D. B. Allison, E. J. Eisen et al., 2005. Bayesian model selection for genome-wide epistatic QTL analysis. Genetics 170: 1333–1344. [DOI] [PMC free article] [PubMed] [Google Scholar]
  97. Yu, K., R. B. Martin and A. S. Whittemore, 2004. a Classifying disease chromosomes arising from multiple founders, with application to fine-scale haplotype mapping. Genet. Epidemiol. 27: 173–181. [DOI] [PubMed] [Google Scholar]
  98. Yu, K., C. C. Gu, M. Province, C. J. Xiong and D. C. Rao, 2004. b Genetic association mapping under founder heterogeneity via weighted haplotype similarity analysis in candidate genes. Genet. Epidemiol. 27: 182–191. [DOI] [PubMed] [Google Scholar]
  99. Zhang, Y.-M., and S. Xu, 2005. A penalized maximum likelihood method for estimating epistatic effects of QTL. Heredity 95: 96–104. [DOI] [PubMed] [Google Scholar]
  100. Zöllner, S., and J. K. Pritchard, 2005. Coalescent-based association mapping and fine mapping of complex trait loci. Genetics 169: 1071–1092. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Genetics are provided here courtesy of Oxford University Press

RESOURCES