Skip to main content
Genetics logoLink to Genetics
. 2007 Oct;177(2):761–771. doi: 10.1534/genetics.107.071407

A Statistical Framework for Expression Quantitative Trait Loci Mapping

Meng Chen *, Christina Kendziorski †,1
PMCID: PMC2034641  PMID: 17660576

Abstract

In 2001, Sen and Churchill reported a general Bayesian framework for quantitative trait loci (QTL) mapping in inbred line crosses. The framework is a powerful one, as many QTL mapping methods can be represented as special cases and many important considerations are accommodated. These considerations include accounting for covariates, nonstandard crosses, missing genotypes, genotyping errors, multiple interacting QTL, and nonnormal as well as multivariate phenotypes. The dimension of a multivariate phenotype easily handled within the framework is bounded by the number of subjects, as a full-rank covariance matrix describing correlations across the phenotypes is required. We address this limitation and extend the Sen–Churchill framework to accommodate expression quantitative trait loci (eQTL) mapping studies, where high-dimensional gene-expression phenotypes are obtained via microarrays. Doing so allows for the precise comparison of existing eQTL mapping approaches and facilitates the development of an eQTL interval-mapping approach that shares information across transcripts and improves localization of eQTL. Evaluations are based on simulation studies and a study of diabetes in mice.


THE quantitative trait loci (QTL) mapping framework developed by Sen and Churchill (2001), referred to hereinafter as the Sen–Churchill framework, unifies many methods for QTL mapping in inbred line crosses. The seminal work of Lander and Botstein (1989) and subsequent methods including Haley–Knott regression (1992), composite interval mapping, and multiple QTL mapping (Jansen 1993; Zeng 1993, 1994; Jansen and Stam 1994), are all represented, at least approximately, as special cases of the framework. The framework also accounts for covariates, nonstandard cross designs, missing genotype data, genotyping errors, multiple interacting QTL, and nonnormal as well as multivariate phenotypes. As a result, it provides a powerful approach to localize the genetic basis of quantitative traits.

There has been much interest recently in identifying the genetic basis of thousands of gene- expression traits measured via microarrays (Brem et al. 2002; Schadt et al. 2003; Yvert et al. 2003; Cox 2004). The multi-trait version of the Sen–Churchill framework is based on the multivariate normal distribution. This approach becomes problematic when the number of traits is larger than the number of subjects, as the estimated covariance matrix will have less than full rank. To address this, we here extend the Sen–Churchill framework to accommodate expression phenotypes. We first highlight aspects of the Sen–Churchill framework important to our development, and then detail the extension. We show that the extended framework generalizes the currently available expression QTL (eQTL) mapping methods and facilitates the development of an approach that allows for both interval mapping of eQTL and information sharing across transcripts. Evaluations are based on simulation studies and a study of diabetes in mice. Generalizations of the framework are also discussed. Many of the technical details can be found in the appendixes.

A FRAMEWORK FOR EXPRESSION QTL INFERENCE

The Sen–Churchill framework:

The Sen–Churchill framework supports a Bayesian approach to QTL mapping that accommodates a variety of phenotypes and data structures. Much of the flexibility of the approach is due to two main features. The first is marked separation of the genetic model, which relates phenotype to genotype, and the linkage model, which relates putative QTL genotype to the marker map. The second feature is that computation relies on an efficient Monte Carlo component instead of a more complex MCMC procedure as employed in a number of other Bayesian QTL methods (Satagopan et al. 1996; Yi and Xu 2000; Yi 2004). As we discuss in detail below, these two features allow for accommodation of microarray data as a phenotype within the framework. We here provide an overview of the framework, focusing on aspects important to our extension.

Suppose that quantitative traits are measured for n members of an inbred line cross. Denote the traits by Inline graphic and denote the corresponding marker data by the n × M matrix m, where M denotes the total number of markers. Marker location and genetic distances are assumed known, although in practice these are estimated. A genetic model H describes the way in which QTL genotypes determine a phenotype; it is prescribed by the number of QTL, their locations, and the way in which they act and interact to affect the phenotype. Assuming p QTL in a genetic model, let γ denote the p-dimensional vector of QTL locations and g denote the n × p matrix of QTL genotypes. The parameters of the genetic model are denoted by μ.

Of primary interest is the posterior distribution of QTL location, p(γ | y, m), given by

graphic file with name M2.gif (1)

where modes of p(γ | y, m) estimate QTL position. An exact evaluation of Equation 1 is computationally prohibitive, but an approximation can be obtained by sampling multiple versions of the putative QTL genotypes g and averaging as follows:

  1. Select a regularly spaced grid G of pseudomarker locations, locations for which genotypes are not known, and create q realizations of the pseudomarkers by sampling from p(g | m). Assuming known genetic distances and no crossover interference, a Markov chain sampling scheme can be used. Each realization of pseudomarker genotypes is an n × G matrix.

  2. For the assumed genetic model H, a p-dimensional vector of pseudomarker locations corresponding to the QTL, γH, is prescribed; and the ith realization of pseudomarker genotypes provides giH), an n × p matrix of pseudomarker genotypes at the QTL locations.

  3. For each realization, calculate a weight under the assumed genetic model H. The weight for the ith realization is
    graphic file with name M3.gif
  4. An average over q of these weights approximates (1), according to the principle of importance sampling
    graphic file with name M4.gif
    for some constant of proportionality C.

Extensions to eQTL mapping:

Consider for simplicity a backcross population genotyped as aa(0) or Aa(1) at M markers (this simplification to a backcross is not required and is relaxed in the Applications to data from a study of diabetes). For eQTL mapping, the observed phenotype data y are no longer a vector as above, but rather a T × n matrix of expression levels. Specifically, y = (y1, y2,…, yT)′, where vector yt = (yt1,…, ytn) denotes the (possibly transformed) expression levels for transcript t measured in n animals. As in the univariate phenotype case, m denotes an n × M matrix containing genotypes on M markers.

Of most interest is the identification of significant linkages between transcripts and genome locations. To be precise, a transcript t is linked to location l if Inline graphic where Inline graphic denotes the latent mean level of expression for transcript t for the population of animals with genotype 0(1) at location l. Two T × G matrices, θ0 and θ1, contain the latent mean levels of expression Inline graphic; and, as above, G denotes the total number of locations considered. In the Sen–Churchill framework, of primary interest is the posterior evaluation of γ, a vector of QTL locations. In this context, γ is transcript specific. For example, for transcript t, γt would contain indexes l′ such that Inline graphic.

Single eQTL mapping methods:

Suppose that a transcript is affected by at most one genotype location l (this assumption can be relaxed as discussed later) and consider inference at location l. Of most interest is the posterior probability that transcript t is linked to location l. We show in appendix b that

graphic file with name M9.gif (2)

where Inline graphic is the marginal density describing the data in the case of linkage to l.

Equation 2 is similar in form to (1), but there are some important differences. In Equation 2, conditioning is done on the full set of transcripts. An assumption of conditional independence across transcripts (see appendixes) yields a right-hand side (RHS) that is evaluated only at the transcript of interest t. The form of fP1 determines whether or not information from other transcripts affects the evaluation. For example, if fP1 is taken to be a univariate Gaussian (or other parametric) distribution, then the RHS is completely determined by the data at t since the parameters of fP1 do not depend on other transcripts. An application of the extended Sen–Churchill framework in this case would consist of a repeated application of a single-transcript analysis to each expression trait in isolation. This has been done in a number of eQTL studies to yield effective results. However, with this approach, there is no information shared across transcripts. As pointed out in a number of articles on microarray data analysis (Newton et al. 2001; Tusher et al. 2001; Kendziorski et al. 2003; Smyth 2004; Cui et al. 2005), information sharing is important to improve sensitivity and moderate test statistics that are otherwise prone to inflated error. Kendziorski et al. (2006) demonstrated this in the context of eQTL studies and proposed the mixture-over-markers (MOM) model to localize eQTL while allowing for information sharing across transcripts. The MOM model is further developed in Gelfond et al. (2007).

The MOM model is represented as a special case of the extended framework when fP1 is taken to be a certain predictive density. In short, assume measurements of transcript t for animal r, denoted ytr, arise as conditionally independent random deviations from an observation distribution fobs(· | Inline graphic, θ) with the Inline graphic's as random effects described by a distribution π(μ). The model is assumed to be the same across locations and so dependence on l is suppressed. In this model, an equivalently expressed transcript t presents data yt according to the distribution

graphic file with name M13.gif (3)

where Inline graphic and fP1(yt) = fP0(Inline graphic) fP0(Inline graphic) describes the data for mapping transcripts, owing to the fact that different mean values, Inline graphic and Inline graphic, govern the different subsets Inline graphic and Inline graphic of samples and are considered independent draws from π(μ) (see appendixes). Here, Inline graphic and Inline graphic denote the collection of expression values from subjects with genotypes aa and Aa, respectively. As detailed in Kendziorski et al. (2006), a Gaussian model is assumed for fobs(·) and π(·). We also allowed for the possibility that different clusters of transcripts could present data with different variances.

Specification of the denominator p(yt | m) of Equation 2 is not required if closed forms for (or good approximations of) parameter estimates are available and estimation of the false discovery rate (FDR) is not of interest. When closed forms are not available and/or calculation of estimated FDR is of interest, p(yt | m) must be evaluated. Note that Inline graphic, where pt = 0) implies that the transcript does not map to any of the G locations. We do not assume any specific priors on the mixing proportions. They will be estimated using the data. As detailed in appendix b, Equation 2 then becomes

graphic file with name M24.gif (4)

Note that conditioning on genotype is dropped if the transcript is not linked to location l as all measurements arise from a distribution with common mean and so genotype information, which prescribes groups in the case of a transcript mapping to l, is not required.

When evaluated at markers only, where genotypes are known, Equation 4 is identical to the MOM model. Extensions of MOM to interval mapping have been difficult to date, as evaluation of Equation 4 can be prohibitive in between markers. Since the lth column of g, denoted gl, is a vector of length n, there are 2n possible genotypes (for a backcross); and as a result, the integral in Equation 4 is a very large mixture, when n is even moderately large. In practice, one could potentially restrict to fewer possibilities since many genotype vectors have very small probabilities. However, as the number of individuals in the study gets large (>200), this quickly becomes computationally infeasible even with the restriction. Fortunately, pseudomarkers can be used, as in the Sen–Churchill framework, to overcome this problem.

In the extended framework, multiple versions of pseudomarkers are sampled from p(gl | m). Suppose for each location l (l = 1,…, G), q genotype vectors are sampled from the proposal distribution p(g | m) to yield (Inline graphic, Inline graphic,…, Inline graphic). Then Equation 4 is approximated by

graphic file with name M28.gif (5)

and modes of this distribution are used to estimate eQTL positions. One can apply this approach to grids of varying sizes (i.e., varying G) to localize eQTL at and in between markers. We refer to this approach as pseudomarker MOM (psMOM).

Simulations:

We conducted a small set of simulations to compare psMOM with traditional interval mapping (IM) applied to each transcript in isolation. The simulations are not designed to capture the many complexities of eQTL data, but rather they provide some preliminary information on operating characteristics in simple settings. Marker genotype data were simulated for four chromosomes, each of length 100 cM and having 11 equally spaced markers (10-cM spacing). We assumed that 15% of all transcripts map to at least one genomic location; 5% map to a single location on chromosome 1 (26 cM); 5% map to two locations on chromosome 2 (44 and 56 cM); the remaining 5% map to two locations on chromosome 3 (22 and 82 cM). No transcripts are affected by alleles on chromosome 4.

Backcross data were simulated for 200 animals and 4000 transcripts. Simulated intensities follow the approach described in Kendziorski et al. (2006). Briefly, we assume log intensities are normally distributed, which is consistent with the assumptions of both IM and psMOM. Transcript-specific means and variances are sampled from the empirical means and variances of the F2 cross described previously. The latent means of transcripts mapping to a single location satisfy Inline graphic. For the transcripts mapping to two locations l = (l1, l2), their latent means satisfy Inline graphic. Twenty simulated data sets were generated.

Implementation of IM:

For IM, we consider fP1 as univariate Gaussian with transcript-specific parameters obtained via the method of moments using only data at that transcript; 500 sets of pseudomarkers are generated every 2 cM and LOD scores are computed. To compare with the highest posterior density (HPD) regions of psMOM (see below), likelihood ratios (LRs) are derived from the LOD scores, normalized, and converted to quantities similar to posterior probabilities. For example, if L(H1, l′)/L(H0) denotes the likelihood ratio at location l′, we consider Inline graphic and Inline graphic as evidence of equivalent and differential expression at l′, respectively. We refer to these as LOD posterior probabilities. Transcripts with LOD posterior probability of differential expression exceeding some threshold are considered mapping transcripts. As shown in Tables 1–3, IM is evaluated for varying thresholds.

TABLE 1.

Power averaged over 20 data sets

Power
0.1
0.2
0.3
0.4
Location psMOM IM psMOM IM psMOM IM psMOM IM
Chr1 0.978 0.974 0.975 0.961 0.973 0.891 0.970 0.768
Chr2 0.797 0.777 0.795 0.774 0.794 0.747 0.788 0.694
Chr2, eQTL1 0.935 0.930 0.933 0.923 0.932 0.888 0.924 0.822
Chr2, eQTL2 0.859 0.847 0.858 0.843 0.858 0.813 0.849 0.753
Chr3 0.568 0.503 0.562 0.495 0.558 0.458 0.545 0.391
Chr3, eQTL1 0.728 0.743 0.724 0.724 0.717 0.655 0.702 0.549
Chr3, eQTL2 0.827 0.731 0.824 0.711 0.818 0.640 0.795 0.521

Standard errors were <0.01. Linkage thresholds were varied from 0.1 to 0.4. Chr, chromosome.

TABLE 2.

FDR averaged over 20 data sets

FDR
0.1
0.2
0.3
0.4
Location psMOM IM psMOM IM psMOM IM psMOM IM
Chr1 0.028 0.146 0.018 0.028 0.013 0.009 0.008 0.005
Chr2 0.024 0.152 0.016 0.025 0.009 0.009 0.007 0.002
Chr3 0.034 0.173 0.023 0.066 0.020 0.041 0.016 0.036

Standard errors were <0.01. Linkage thresholds were varied from 0.1 to 0.4. As noted in the Power and false discovery rate calculation section, the FDR estimates shown here do not consider mapping transcripts that map outside the 10-cM window of the eQTL. Considering these transcripts greatly inflated the FDR for IM, alternative methods for LOD profile normalization did not yield results better than those shown here.

TABLE 3.

Specificity averaged over 20 data sets

Specificity
0.1
0.2
0.3
0.4
Location psMOM IM psMOM IM psMOM IM psMOM IM
Chr1 0.882 0.805 0.885 0.882 0.886 0.901 0.889 0.916
Chr2 0.884 0.806 0.887 0.883 0.889 0.905 0.891 0.922
Chr3 0.883 0.805 0.886 0.881 0.888 0.898 0.889 0.911

Standard errors were <0.01. Linkage thresholds were varied from 0.1 to 0.4.

For some examples (noted in subsequent text), to compare with the HPD regions derived from LOD posterior probabilities, we also considered 1.5-LOD drop-support intervals around peak LOD scores (Mangin et al. 1994; Dupuis and Siegmund 1999). They are designed to target confidence regions of level 95%, but in general, these intervals are known to be biased in that they are too small (Visscher et al. 1996). On the other hand, confidence intervals that are slightly too small favor IM as eQTL appear to be better localized. To give IM the best results, we consider a 10-cM window around the true eQTL positions and define the respective LOD peaks as the highest LODs within the windows. The 1.5-LOD support intervals are then constructed. Of course, in practice, one does not have the luxury of knowing where to choose these peaks and perhaps only the largest peak would be identified. In this way, the results of this approach are further biased in favor of IM.

Implementation of psMOM:

Equation 4 is first evaluated at the genotyped markers and the M′ markers with posterior probabilities ≥0.9 define an HPD region. In particular, the posterior probabilities at the identified M′ markers are averaged across the mapping transcripts and, using this M′ vector, an HPD region is identified. Basically, the HPD region contains the minimum number of support points with corresponding posterior probabilities having a sum exceeding 1 – α. More precisely, the HPD region of level 1 − α is constructed by rank ordering posterior probabilities p(1)p(2) ≤ … ≤ p(n) where Inline graphic and identifying the largest (n′) such that Inline graphic. The HPD region then consists of the support points corresponding to p(n′), p(n′+1),…, p(n).

A pseudomarker grid is set up within the HPD region and multiple versions of pseudomarkers are generated. The model fit is carried out utilizing all genotyped markers plus pseudomarkers in the HPD regions. The procedure provides a matrix of posterior probabilities for every transcript at every test point. As in IM, 500 sets of pseudomarkers are generated every 2 cM; unlike IM, the pseudomarkers are only generated within the HPD region and thus the dimension of G here is reduced, thereby reducing the computational burden. The procedure gives a matrix of posterior probabilities for every transcript and every test point. Transcript t is defined to map to location l if the posterior probability in the second stage exceeds some threshold. As in IM, psMOM is evaluated for varying thresholds.

Choice of threshold:

A list of mapping transcripts with target FDR α can be constructed by taking those with posterior probability of equivalent expression less than α (Newton et al. 2004). This specifies transcripts that likely harbor at least one eQTL, but does not provide information on the total number of eQTL per transcript. For the latter, a linkage threshold must be set. The thresholds evaluated here for both IM and psMOM are varied from 0.1 to 0.4. For example, recall that the LOD posterior probability profiles from IM and the posterior probability profiles from psMOM each sum to 1. When a single eQTL is simulated, often the (LOD) posterior probability of linkage is quite large at the location of the eQTL (e.g., >0.95). However, with two eQTL, individual posterior probabilities are rarely that large since evidence is spread out across multiple locations. Thresholds could also be chosen on the basis of transcript-specific HPD regions (see supplemental material at http://www.genetics.org/supplemental/ for an example).

Power and false discovery rate calculation:

A call is said to be “correct” if the genome location identified is within 5 cM of the true eQTL (i.e., within the 10-cM window centered at the true eQTL location). In the case of two eQTL, at least one location has to be within the 10-cM window of a true eQTL for the identified eQTL to be deemed as correct. Power measures the ability to correctly identify mapping transcripts. It is calculated to be the ratio of the number of correct calls to the total number of eQTL. FDR is calculated as the ratio of the number of incorrect calls to the total number of calls. Incorrect calls consist here of nonmapping transcripts that are identified to map. Our calculation of FDR does not consider mapping transcripts that map outside the 10-cM window of the eQTL, since this led to greatly inflated FDR for IM. Specificity represents the proportion of nonmapping transcripts that are identified as nonmapping.

Tests for enrichment:

A number of efforts utilize information from multiple sources to annotate transcripts; and it is informative to identify sets of transcripts that are enriched for some annotation compared with a randomly sampled set of the same size. A hypergeometric calculation is often used to assess evidence of enrichment, but interpretation of resulting P-values is not straightforward due to the many dependent hypotheses tested. Furthermore, the hypergeometric calculation tends to result in small P-values when few transcripts are considered. For these reasons, it has been suggested that one consider only interesting small P-values obtained from a relatively large set of transcripts (>10) (Gentleman 2004). That is the practice we follow here, considering lists of at least size 10 and setting P-value thresholds for enrichment at 0.05. We considered biological functions annotated in the Gene Ontology (GO) database.

Software:

All calculations were carried out in R (http://www.r-project.org). The IM method was performed using the scanone function with the “imp” option in R/qtl (Broman et al. 2003).

RESULTS

Simulation results:

The results from a single simulation are shown in Figure 1 (results are representative of those observed in the other 19 simulations). The top graphs show results from psMOM and the bottom graphs show those from IM. The left graphs demonstrate the average linkage evidence -posterior probabilities in psMOM and LOD posterior probabilities in IM; and the right graphs show transcript-specific HPD regions.

Figure 1.—

Figure 1.—

Plots of average linkage evidence (left) and transcript-specific linkage evidence (right) from one simulated data set (results are representative of the other 19 simulations). The top left graph shows results from psMOM (and MOM) in dashed red (and solid black) lines; the bottom left graph from IM (and transcript-specific marker regression). MOM (marker regression) refers to psMOM (IM) applied only at genotyped markers. The right graphs show transcript-specific HPD regions (in red) from psMOM (top) and IM (bottom). The true eQTL positions are indicated by “< >” on the x-axes; chromosomes 1–4 are delineated by dotted vertical lines.

From the average linkage evidence shown in the left graphs, the two approaches considered are very similar: they each identify the regions for the single eQTL on chromosome 1 and the two unlinked eQTL on chromosome 3. Both MOM and marker regression miss the two linked eQTL, identifying a wide peak only in the middle region of chromosome 2. The interval-mapping approaches (psMOM and IM) further refined the eQTL underneath this wide peak.

Differences between the approaches are more pronounced when transcript-specific linkage evidence is considered. As shown in the right graphs of Figure 1, psMOM precisely identifies the eQTL correctly for most of the mapping transcripts. In contrast, the regions surrounding the IM identifications are relatively wide. This result could be due to the way the LR normalization was done, to the fact that information shared across transcripts is not accounted for, or to both. To test the former, instead of HPD regions constructed from LOD posterior probabilities, we considered confidence intervals constructed using 1.5-LOD drop intervals around the peak LOD score. As detailed in the Implementation of IM section, this procedure favors IM. Even so, the approach still provided very imprecise estimates of eQTL location, much worse than those shown in Figure 1, and we do not recommend this in practice.

The results for this single simulation hold across simulations as shown in Tables 1–3. Table 1 reports power at varying thresholds averaged over 20 simulated data sets for each eQTL and each chromosome. For low thresholds, power is similar for both approaches, with psMOM showing slightly higher power. Table 2 shows that FDR from psMOM is well controlled for the three chromosomes. The level stays the same under different thresholds. On the other hand, the FDR from IM is quite high at the 0.1 cutoff point; it decreases with increasing threshold, but with reduced power. Table 3 shows the specificity calculated over an average of 20 simulated data sets. The specificities are very similar for both approaches and they are satisfactory.

Applications to data from a study of diabetes:

The data set considered here is discussed in detail in Lan et al. (2006) and is available at GEO (Barrett et al. 2007), accession no. GSE3330. Briefly, 60 mice (29 males and 31 females) were selected from an F2 population segregating for phenotypes associated with diabetes and obesity (Stoehr et al. 2000). The population was derived from B6 male and BTBR female parents. Selection was based on the selective phenotyping algorithm developed in Jin et al. (2004), which can substantially improve sensitivity for QTL localization compared with random sampling of the same sample size (Jin et al. 2004). The marker map consists of 145 microsatellite markers spanning the 19 mouse autosomes, with an average intermarker distance of 13 cM. Over 90% of the animals are genotyped at any given marker.

Liver total RNA was extracted from frozen tissue samples with RNAzol reagent (Tel-Test). Crude RNA samples were purified with RNeasy mini columns (QIAGEN, Valencia, CA) before hybridization. The RNA samples were processed according to the Affymetrix Expression Analysis technical manual. Expression levels for 45,265 probe sets (referred to hereinafter as transcripts) were measured using the MOE430A and MOE430B chips for each of the 60 F2 mice. Preprocessing and normalization was done using robust multi-array average (RMA) (Irizarry et al. 2003) to obtain a single normalized summary score of expression for each gene in each animal.

Both IM and psMOM were applied to the data; psMOM accommodates F2 populations by increasing the number of expression patterns. For example, with three genotype groups (0, 1, and 2) there are three latent means of interest and four non-null expression patterns for each transcript t at each location Inline graphic.

Posterior probabilities from psMOM and LOD posterior probabilities from IM were averaged across transcripts to identify the genomic regions of most interest. As in the simulation study, the average results from IM and from psMOM largely agreed (Figure 2a). However, when looking at a finer scale, one does observe important differences (Figure 2b). There are two locations in particular (on the distal regions of chromosomes 2 and 5) where psMOM shows some evidence for linkage but IM does not. To test whether the regions identified by psMOM might be meaningful, we consider the biological functions of identified transcripts.

Figure 2.—

Figure 2.—

Figure 2.—

Average posterior probabilities from the F2 population. (a) Genomewide view; (b) chromosomes 1, 2, 5, and 8.

We tested for functional enrichment among the transcripts mapping to the two subpeaks (call these trans1a and trans1b) on the distal region of chromosome 2. Both sets show significant enrichment of lipid-metabolism and fatty-acid-metabolism genes (P-values are 0.0016 and 0.0044, respectively). The lipid-metabolism group on the distal region of chromosome 2 coincides with the lipid-metabolism cluster discovered in Lan et al. (2006). Several QTL for obesity and related traits have been mapped to this region (Stoehr et al. 2000). As shown in Figure 2b, there is some evidence of linkage (a single peak) on the distal region of chromosome 2 provided by MOM (the first pass of psMOM at markers only). The transcripts mapping to this region did not show significant linkages for lipid metabolism or fatty acid metabolism (P-values are 0.2465 and 0.2302, respectively) or for any categories that appeared to be related to our diabetes or obesity phenotypes of interest.

In addition to the chromosome 2 linkages, we considered two linkage peaks on the distal region of chromosome 5, near the marker D5Mit240, as these peaks are identified by psMOM alone. As on chromosome 2, tests here show enrichment for lipid-transport and fatty-acid-metabolism genes. In addition, the enrichment of genes responsible for positive regulation of metabolism is highly significant (P-value = 0.002). A closer look at the mapping list reveals some interesting members. They include PPARα and PPARγ, two major lipid-metabolism transcription factors (Attie and Kendziorski 2003). Other interesting genes include fatty-acid-synthase genes (Fasn, Elovl6, Elovl5, and Fads2), lipid-transport genes (Scp2, Pltp, and Apoa4), and two fatty-acid-metabolism genes (Gpam and CD36). Taken together, these results provide some support for the peaks uniquely identified by psMOM.

DISCUSSION

We have extended the QTL mapping framework of Sen and Churchill (2001) to accommodate expression phenotypes. The Bayesian formulation prescribed by Sen and Churchill (2001) and the pseudomarker-sampling approach developed there is maintained. Our extension relies on specifying a more general form for the genetic model, the model relating phenotype to genotype. By fitting the full genetic model to all phenotypes simultaneously, information can be shared across transcripts through the estimated hyperparameters, which in many cases leads to improved inference.

The extended framework generalizes most eQTL mapping approaches and in doing so facilitates their understanding, evaluation, and precise comparison by revealing their specific characteristics in the context of a common notation, which in turn provides an improved environment for addressing open questions and developing ideas for future methods. As an example, we considered a deficiency of the MOM model, namely that no information is provided between markers. Viewing MOM as a special case of the extended framework clarified how to address this deficiency using pseudomarkers. We expect that other open questions in the area of eQTL mapping can be more readily addressed in the context of this unified framework.

One such question might be the choice of an appropriate threshold. In most eQTL studies to date, thresholds are varied and the one that yields a list with many transcripts while controlling some measure of false positives at a reasonable level is used. A number of false positive measures have been considered; and, clearly, investigators define “many” and “reasonable” quite differently in different contexts (Kendziorski and Wang 2006). The framework presented here can be used to investigate common approaches and perhaps to rigorously address this question.

Without exact knowledge of appropriate thresholds for either psMOM or IM, our evaluations were based on varying thresholds. There appears to be a slight advantage of psMOM over IM, likely due to the information shared across transcripts. For some thresholds, the advantage is negligible; while for others, it is much more pronounced. A bigger advantage of psMOM is the precision provided by the HPD regions. Analogous regions were constructed from the LOD profiles, but these provided much less precise localization than the standard HPD regions of psMOM. Indeed, as we noted in Power and false discovery rate calculation section, the FDRs shown in Table 2 did not consider differentially expressed transcripts that mapped outside the 10-cM window of the eQTL. Considering these transcripts greatly inflated the FDR for IM. Considering alternative methods for LOD profile normalization did not yield results better than those shown here.

Finally, we have focused on representation of approaches for single-eQTL mapping. The simulation results show that the approaches can work well, even for two eQTL settings, much like single-QTL models can provide information on multiple QTL. Of course, single-QTL models will not work well when multiple QTL are tightly linked; and we here note that extensions of psMOM as presented are possible. In the context of the extended framework, the extension is seen as changes in fPk, where now k > 2. In particular, if a transcript t is affected by two genotype locations l1 and l2, then four latent means are of interest: Inline graphic, and Inline graphic. Here Inline graphic denotes the latent mean level of expression for transcript t for the populations of animals with genotype (g1, g2) at locations l1 and l2. These latent means can be arranged into 15 possible expression patterns, all of which may be of interest (see supplemental materials at http://www.genetics.org/supplemental/ for pattern specification and further detail). As before, of primary interest is the posterior probability of particular expression patterns. These can be calculated for any pattern of interest.

The approach was applied to the simulated data described previously using all 15 expression patterns (i.e., k = 0, 1, … , 14). Results for one data set (representative of the other 19) are presented in Figure 3. Figure 3a shows the posterior probabilities of P6 (additive model with equal effects) calculated for each marker pair averaged over all the transcripts. Because of symmetry, only the bottom triangle was plotted. The posterior probabilities from single-eQTL psMOM are shown on the diagonal. The two eQTL on chromosomes 2 and 3 are located between markers 5 and 7 and markers 3 and 9, respectively; psMOM identified them with fairly strong evidence.

Figure 3.—

Figure 3.—

Figure 3.—

Heat map from the two-dimensional model scan for the simulated data. (a) Average posterior probabilities from the two-eQTL psMOM model; (b) average LOD scores from the 2-QTL IM model. Boxes show the true eQTL positions (a single eQTL at 26 cM on chromosome 1, two eQTL at 44 and 56 cM on chromosome 2, and two eQTL at 22 and 82 cM on chromosome 3). The false identifications by IM are largely due to moderate evidence of linkage combined in the average over simulations (see supplemental data at http://www.genetics.org/supplemental/).

Figure 3b shows the LOD scores derived from a standard two-QTL IM approach. The top triangle contains LOD scores for epistatic interactions; the diagonal shows LOD scores from a single QTL model; the bottom triangle shows LOD scores for the additive model. Because the simulations did not include an interaction, the top triangle correctly shows very little linkage signal. In the bottom half, however, the entire path between markers 5 and 7 on chromosome 2 and markers 3 and 9 on chromosome 3 is highlighted with the highest LOD scores occurring at the marker pair regions between chromosomes 1 and 2 and 2 and 3. In contrast, the graph from the two eQTL psMOM model gives improved localization of the true eQTL. This is promising evidence for the utility of extending psMOM to multiple eQTL.

One of the main obstacles in the multiple-eQTL model extension is the computational burden. The number of components in the mixture model grows rapidly with the number of loci under investigation. Fitting the full model can be a daunting task. A Dirichlet process mixture model for which the number of components is no longer a bottleneck has been introduced (Chen 2006) and is currently under investigation.

Acknowledgments

The authors thank Gary Churchill, Jessica Flowers, Michael Newton, Saunak Sen, and Ping Wang for useful discussions. This work was supported in part by National Institute of General Medical Sciences (R01GM076274-01) (G.M.) and National Institute of Diabetes and Digestive and Kidney Disease (R01DK066369-03) (C.K.).

APPENDIX A

Assume transcript t is linked to location l. For a backcross, there are then two distinct latent expression means, one for each genotype group, denoted by Inline graphic and Inline graphic. Consider a conditional distribution of measurements for animals with genotype 0 given by Inline graphic, r = 1,…, n and a prior distribution on Inline graphic given by Inline graphic | θ ∼ πθ(μ),t =

1,…, T. The notation for dependence on l has been suppressed. Under this model, the marginal distribution of measurements Inline graphic is given by

graphic file with name M45.gif (A1)

The same form holds for Inline graphic. The marginal distribution of measurements yt is then given by fP1(yt) = fP0( Inline graphic)fP0( Inline graphic), assuming conditional independence of Inline graphic and Inline graphic given the latent expression means Inline graphic and Inline graphic.

For calculations presented here, we evaluate expression measurements on the log scale and assume a Gaussian model for fobs(), with variance σ2; π() is also Gaussian with mean μ0 and variance Inline graphic. Hence, the hyperparameters shared by all transcripts are σ2, μ0, and Inline graphic. The joint predictive density, fP0, is then also Gaussian with mean n vector (μ0, μ0,…, μ0) and exchangeable covariance matrix

graphic file with name M55.gif

where In is an n × n identity matrix and Mn is an n × n matrix of ones. Further detail can be found in Kendziorski et al. (2003).

APPENDIX B

The posterior distribution of the QTL location for transcript t is given by

graphic file with name M56.gif (B1)

In detail,

graphic file with name M57.gif

where yt denotes the matrix of expression phenotypes with transcript t omitted. We assume here (second equality) that the distribution of the marker genotypes is independent of the eQTL location and latent expression means (this is analogous to the assumption made in Sen and Churchill (2001) in their Appendix A in justifying their final equality with latent expression means here corresponding to their model parameters). We further assume that yt is conditionally independent of yt given the latent expression mean for the tth transcript, μt, and that the latent expression means are independent across transcripts (third equality). A similar derivation gives p(y, m) = p(yt | m)p(yt | m)p(m).

Substituting these quantities into (B1), we have

graphic file with name M58.gif (B2)

Note that p(yt | m, γt = l) in the numerator of (B2) can be further written as

graphic file with name M59.gif

Here, g represents the eQTL genotype of the tth transcript. We once again assume that the distribution of the marker genotypes is independent of the eQTL location (second equality). The third equality follows from the assumption that the expression levels of the tth transcript are independent of eQTL locations and marker genotypes given the eQTL genotype and latent expression means (this is similar to the second equality of Sen and Churchill 2001, their Appendix A, with latent expression means corresponding to their model parameters) and that the eQTL genotype is independent of the latent expression means for transcript t given the eQTL locations and markers (similar to Sen and Churchill 2001, their Appendix A, third equality). The last equality is given by the definition of fP1.

In summary, we have

graphic file with name M60.gif (B3)

for Inline graphic.

The notation πInline graphic implies that integration with respect to Inline graphic is a two-dimensional integral over the joint distribution of the latent means in the two genotype conditions.

References

  1. Attie, A., and C. Kendziorski, 2003. Pcg-1alpha at the crossroads of type 2 diabetes. Nat. Genet. 34: 244–245. [DOI] [PubMed] [Google Scholar]
  2. Barrett, T., D. Troup, S. Wilhite, P. Ledoux, D. Rudnev et al., 2007. NCBI GEO: mining tens of millions of expression profiles–database and tools update. Nucleic Acids Res. 33: D562–D566. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Brem, R., G. Yvert, R. Clinton and L. Kruglyak, 2002. Genetic dissection of transcriptional regulation in budding yeast. Science 296: 752–755. [DOI] [PubMed] [Google Scholar]
  4. Broman, K., H. Wu, S. Sen and G. Churchill, 2003. R/qtl: Qtl mapping in experimental crosses. Bioinformatics 19: 889–890. [DOI] [PubMed] [Google Scholar]
  5. Chen, M., 2006. Statistical methods for expression quantitative trait loci (eQTL) mapping. Ph.D. Thesis, University of Wisconsin, Madison, WI. [DOI] [PubMed]
  6. Cox, N., 2004. An expression of interest. Nature 430: 733–734. [DOI] [PubMed] [Google Scholar]
  7. Cui, X., G. Hwang, J. Qiu, N. Blades and G. Churchill, 2005. Improved statistical tests for differential gene expression by shrinking variance components estimates. Biostatistics 6: 59–75. [DOI] [PubMed] [Google Scholar]
  8. Dupuis, J., and D. Siegmund, 1999. Statistical methods for mapping quantitative trait loci from a dense set of markers. Genetics 151: 373–386. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Gelfond, J., J. Ibrahim and F. Zou, 2006. Proximity model for expression quantitative trait loci (eqtl) detection. Biometrics 62(1): 19–27. [DOI] [PubMed] [Google Scholar]
  10. Gentleman, R., 2004. Using GO for statistical analyses, pp. 171–180 in Proceedings of COMPSTAT 2004 Symposium, Prague.
  11. Irizarry, R., B. Hobbs, F. Collin, Y. Beazer-Barclay, K. Antonellis et al., 2003. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 4: 249–264. [DOI] [PubMed] [Google Scholar]
  12. Jansen, R., 1993. A general mixture model for mapping quantitative trait loci by using molecular markers. Theor. Appl. Genet. 85: 252–260. [DOI] [PubMed] [Google Scholar]
  13. Jansen, R., and P. Stam, 1994. High resolution of quantitative traits into multiple loci via interval mapping. Genetics 136: 1447–1455. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Jin, C., H. Lan, A. Attie, D. Bulutuglo, G. Churchill et al., 2004. Selective phenotyping for increased efficiency in genetic mapping studies. Genetics 168: 2285–2293. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Kendziorski, C., M. Chen, M. Yuan, H. Lan and A. Attie, 2006. Statistical methods for expression quantitative trait loci (eqtl) mapping. Biometrics 62: 19–27. [DOI] [PubMed] [Google Scholar]
  16. Kendziorski, C., M. Newton, H. Lan and M. Gould, 2003. On parametric empirical bayes methods for comparing multiple groups using replicated gene expression profiles. Stat. Med. 22: 3899–3914. [DOI] [PubMed] [Google Scholar]
  17. Kendziorski, C., and P. Wang, 2006. A review of statistical methods for expression quantitative trait loci mapping. Mamm. Genome 17: 509–517. [DOI] [PubMed] [Google Scholar]
  18. Lan, H., M. Chen, J. Byers, B. Yandell, D. Stapleton et al., 2006. Combined expression trait correlations and expression quantitative trait locus mapping. PLoS Genet. 2: 0051–0061. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Lander, E., and D. Botstein, 1989. Mapping mendelian factors underlying quantitative traits using rflp linkage maps. Genetics 121: 185–199. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Mangin, B., B. Goffinet and A. Rebai, 1994. Constructing confidence intervals for qtl location. Genetics 138: 1301–1308. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Newton, M., C. Kendziorski, C. Richmond, F. Blattner and K. Tsui, 2001. On differential variability of expression ratios: improving statistical inference about gene expression changes from microarray data. J. Comput. Biol. 8: 37–52. [DOI] [PubMed] [Google Scholar]
  22. Newton, M., A. Noueiry, D. Sarkar and P. Ahlquist, 2004. Detecting differential gene expression with a semiparametric hierarchical mixture method. Biostatistics 5: 155–176. [DOI] [PubMed] [Google Scholar]
  23. Satagopan, J., B. Yandell, M. Newton and T. Osborn, 1996. A bayesian approach to detect quantitative trait loci using markov chain monte carlo. Genetics 144: 805–816. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Schadt, E., S. Monks, T. Drake, A. Lusis, N. Che et al., 2003. Genetics of gene expression surveyed in maize, mouse and man. Nature 422: 297–302. [DOI] [PubMed] [Google Scholar]
  25. Sen, S., and G. Churchill, 2001. A statistical framework for quantitative trait mapping. Genetics 159: 371–387. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Smyth, G., 2004. Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat. Appl. Genet. Mol. Biol. 3: 1–27. [DOI] [PubMed] [Google Scholar]
  27. Stoehr, J., S. Nadler, K. Schueler, M. Rabaglia, B. Yandell et al., 2000. Genetic obesity unmasks nonlinear interactions between murine type 2 diabetes susceptibility loci. Diabetes 49: 1946–1954. [DOI] [PubMed] [Google Scholar]
  28. Tusher, V., R. Tibshirani and G. Chu, 2001. Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl. Acad. Sci USA 98: 5116–5121. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Visscher, P., R. Thompson and C. S. Haley, 1996. Confidence intervals in qtl mapping by bootstrapping. Genetics 143: 1013–1020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Yi, N., 2004. A unified markov chain monte carlo framework for mapping multiple quantitative trait loci. Genetics 167: 967–975. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Yi, N., and S. Xu, 2000. Bayesian mapping of quantitative trait loci for complex binary traits. Genetics 155: 1391–1403. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Yvert, G., R. Brem, J. Whittle, J. Akey, E. Foss et al., 2003. Trans-acting regulatory variation in saccharomyces cerevisiae and the role of transcription factors. Nat. Genet. 35: 57–64. [DOI] [PubMed] [Google Scholar]
  33. Zeng, Z., 1993. Theoretical basis of separation of multiple linked gene effects on mapping quantitative trait loci. Proc. Natl. Acad. Sci USA 90: 10972–10976. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Zeng, Z.-B., 1994. Precision of mapping of quantitative trait loci. Genetics 136: 1457–1468. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Genetics are provided here courtesy of Oxford University Press

RESOURCES