Skip to main content
Genetics logoLink to Genetics
. 2007 Aug;176(4):2405–2420. doi: 10.1534/genetics.107.071779

A Covariance Structure Model for the Admixture of Binary Genetic Variation

Mark N Grote 1,1
PMCID: PMC1950641  PMID: 17565951

Abstract

I derive a covariance structure model for pairwise linkage disequilibrium (LD) between binary markers in a recently admixed population and use a generalized least-squares method to fit the model to two different data sets. Both linked and unlinked marker pairs are incorporated in the model. Under the model, a pairwise LD matrix is decomposed into two component matrices, one containing LD attributable to admixture, and another containing, in an aggregate form, LD specific to the populations forming the mixture. I use population genetics theory to show that the latter matrix has block-diagonal structure. For the data sets considered here, I show that the number of source populations can be determined by statistical inference on the canonical correlations of the sample LD matrix.


ADMIXTURE, the mixing of genetically differentiated populations via migration and subsequent intermating, can create linkage disequilibrium (LD) between genes, even when the genes are not physically linked (see, e.g., Cavalli-Sforza and Bodmer 1971, p. 69; Prout 1973). In this work, I show that admixture contributions to LD can be statistically quantified and distinguished from LD attributable to the shared ancestry of linked alleles. Ohta (1982) used Wright's island model to decompose a squared coefficient of LD into within- and between-population terms, in analogy with Wright's (1940) decomposition of the inbreeding coefficient. These decompositions assume that populations connected by migration can be identified and sampled for genetic variation. The method I propose uses a pairwise LD matrix sampled from an admixed population of unknown composition; the number of source populations and the components of LD are inferred by use of a multivariate statistical model.

The data: blocks of binary markers:

It is convenient to develop the model using gametes as the basic units of observation. The data are then n random binary vectors of the form x = (x1,…xL)′, with xl ∈ {0, 1}; l = 1,…, L. Each vector represents the single nucleotide polymorphism (SNP) variation on one sampled gamete, under an arbitrary binary coding scheme, with xl indicating the allele on gamete x at the lth marker locus. The markers are assumed to be selectively neutral and variable in the sample.

Before statistical analysis begins, the markers in x are to be grouped into blocks by the investigator, based on physical criteria (e.g., the markers in a block share a localized region on a physical map), along with empirical evidence (e.g., the markers in a block are known by a previous linkage-mapping study to form a linkage group) independent of the sample under consideration. Each marker belongs to exactly one block; however, a particular marker may be the only member of a block. Any two markers l, m within the same block are assumed to be linked, with recombination fraction Inline graphic. In contrast, markers l, j from different blocks are assumed to be unlinked, with recombination fraction cljInline graphic.

In the development below, block structure derives from physical and linkage relationships between markers, with blocks analogous to linkage groups; this is to be distinguished from empirical descriptions of “haplotype block structure” (see, e.g., Gabriel et al. 2002; Phillips et al. 2003). Recent work of the International HapMap Consortium (2005) and Myers et al. (2005) suggests that haplotype block structure results from variation in recombination rates over small physical distances. The fine-scale rate estimates obtained by Myers et al. (2005) demonstrate new tools for constructing blocks of tightly linked markers. In the data examples to follow, I form two blocks of markers on the arms of the human X chromosome, using simple physical criteria.

To fix notation, I assume that x can be partitioned into B blocks, labeled 1,…, B in any convenient order (though the same partitioning and labeling scheme is used for all gametes). For b = 1,…B, the bth block contains Lb ≥ 1 marker loci, with Inline graphic. I order the vector elements x1,…, xL so that the indices Inline graphic are assigned to the loci of the bth block. However, markers within a block need not be ordered in any particular way.

For binary loci, linkage disequilibrium is covariance:

For binary loci, a familiar pairwise LD parameter (Lewontin and Kojima 1960) is a covariance between locus-specific indicator variables. Let pl be the population frequency of the “1” allele at locus l, let plm be the population frequency of gametes carrying the “1” allele at both loci l and m, and let E be the expectation operator for the distribution on (xl, xm) ∈ {0, 1} × {0, 1} parametrized by pl, pm, and plm. The LD parameter for loci l and m is then

graphic file with name M5.gif (1)

For consistent definitions, I take Dll to be the variance of the locus-specific indicator:

graphic file with name M6.gif (2)

Given a random sample of binary gametes x1 , … , xn, with Inline graphic, I treat the l, m entry of the sample covariance matrix

graphic file with name M8.gif (3)

as an estimate of Dlm. I use the denominator n in (3), rather than the denominator n − 1 typically used for unbiased estimation, for consistency with sample allele frequency calculations. The components of linkage disequilibrium under the admixture model are determined by the structure of S and the sample allele frequencies. Other sample-based estimates of the population covariance matrix may provide suitable alternatives to S; the composite LD estimates described by Cockerham and Weir (1977; see also Weir 1979) are particularly attractive, as they do not require direct observation of gametes. A complete description of the associations among L binary markers requires the use of third- and higher-order moments, along with the second-order moments considered here (see, e.g., Ekholm et al. 1995). Nonetheless, relevant structure in the data induced by admixture can be detected by an analysis limited to covariances.

In a recent article, Patterson et al. (2006) also use a sample covariance matrix to make inferences about admixture. Although the locus-specific variables defined by Patterson et al. refer to diploid genotypes and take values in {0, 1, 2}, it is useful to establish a relationship between the matrix they analyze and S. In the notation above (up to differences in variable definitions, and omitting a normalization step), the n × n matrix of inter-individual covariances analyzed by Patterson et al. (2006) could be written as

graphic file with name M9.gif (4)

Thus the implicit expectation is over loci rather than over individuals in the population, and covariances are between multilocus genotypes rather than between locus-specific indicators. Wickens (1995) uses the terms variable space and subject space to distinguish the multivariate spaces that give rise to S and (4), respectively. Gower (1966) and Jolliffe (2002, Sect. 5.2) compare and contrast the two approaches, although a comparison in the context of admixture inference is beyond the scope of this work.

THE PULSE-DECAY MODEL OF ADMIXTURE AND A MATRIX DECOMPOSITION

The “pulse-decay” model (shown schematically in Figure 1) is a highly simplified admixture model that, somewhat unexpectedly, shares mathematical properties with traditional covariance structure models. It is a variation of the “immediate admixture” model of Ewens and Spielman (1995), emphasizing different statistical properties. I give a detailed derivation to show that the statistical model used later for data analysis is implied by the pulse-decay model.

Figure 1.—

Figure 1.—

Schematic of the pulse-decay model of admixture, showing representative population parameters. Diagonal arrows indicate the transfer of randomly chosen genotypes, contributed to the admixed population in the proportions q(1),…, q(K). Vertical arrows indicate transitions of the Wright–Fisher process in the admixed population. 2N(k) is the number of gametes present in source population k at generation 0, pl(k) and pm(k) are the frequencies of the “1” allele at loci l and m, and Dlm(k) is the “standing” LD between loci l and m in source population k. Further details of the model are given in the text.

Under the pulse-decay model, generation 1 of the admixed population consists of N diploid genotypes sampled at random with replacement from K ≥ 2 source populations. The probability that a genotype is chosen from the kth source population is q(k) > 0; k = 1,…, K, with Inline graphic. The mixing proportions q(1),… q(K) are unknown. In practice, K may be an unknown parameter requiring estimation. A genotype contributed by the kth source population consists of two gametes chosen at random with replacement from the 2N(k) gametes present at generation zero in the kth source population. At generation zero, the frequency of the “1” allele at locus l in source population k is pl(k), k = 1,…, K, and the covariance between loci l and m in source population k is Dlm(k). I call the vector of allele frequencies Inline graphic′ from source population k p(k), and I assume that the source population allele frequency vectors p(1),…, p(K) are linearly independent. The latter assumption implies that the number of markers L is at least as large as the number of source populations K. Linear independence of p(1),…, p(K) also implies that no source population is an admixture of the other source populations. I describe the effect of weakening the assumption of linear independence briefly in a later section. Under the pulse-decay model, the source populations make no further contributions to the sampled population after generation 1. Mating is at random in the admixed population from generation 1 onward.

By straightforward calculation, the expected frequency of the “1” allele at locus l, at generation 1 in the admixed population, is Inline graphic, where the expectation is over samples drawn from the source populations. Similarly, the covariance between loci l, m at generation one in the admixed population is

graphic file with name M13.gif (5)

where Dll(k) is the locus-specific variance in the kth source population.

The expected covariance in generation t + 1, over realizations of the Wright–Fisher process in the admixed population, is

graphic file with name M14.gif (6)

where clm = cml ∈ (0, Inline graphic] is the recombination fraction between loci l and m (see, e.g., Karlin and McGregor 1968, Equation 7). Weir and Cockerham (1974) provide related expressions for “genotypic” disequilibria (see Weir 1996, Chap. 3), which completely specify two-locus dynamics in diploid organisms. Whereas D1,lm of Equation 5 is a population parameter, 𝒟t+1,lm of Equation 6 is a random variable. In Equation 6, the value clm = 0, corresponding to complete linkage of l and m, is reserved for the variance 𝒟ll, where it gives the formula for the decay of heterozygosity in a finite population (see e.g., Crow and Kimura 1970, Section 3.11).

Examining Equation 6 with statistical modeling in mind, it is attractive to write the realized covariance in generation t + 1 as

graphic file with name M16.gif (7)

where ρt,lm = ρt,ml is an unknown parameter giving the proportion of initial covariance (or variance) remaining after t transitions of the Wright–Fisher process. A sample from the admixed population in generation t + 1 produces an estimate of Dt+1,lm, via the l, m element of the sample covariance matrix. Although ρt,lm is an important structural component of the model, it will not be an object of estimation. To establish later mathematical results, I assume that there exist ε and θ such that

graphic file with name M17.gif (8)

This assumption places uniform upper and lower bounds on the extent to which covariances (or variances) have decayed since the admixture “pulse,” for the loci included in the study. Covariances between unlinked loci (with clm = Inline graphic) are expected to decay much more quickly than covariances between linked loci: for unlinked loci, ρt,lm is on the order of 2t. In light of this observation, (8) significantly limits the number of generations that could have occurred since the admixture pulse. Assumption (8) also rules out changes in the signs of covariances, up to and including generation t + 1. The pulse-decay model is thus revealed to be primarily suited for recent admixture. Accordingly, any additional changes in allele frequencies and covariances due to mutation are assumed to be small enough to ignore, within the time-scale of the model.

Under the pulse-decay model, the realized covariance matrix in the admixed population at generation t + 1 can be written as

graphic file with name M19.gif (9)

where RInline graphic is the symmetric matrix having l, m element ρt,lm, ⊙ is the Hadamard (elementwise) product,

graphic file with name M21.gif
graphic file with name M22.gif

Inline graphic, and Inline graphic is the symmetric matrix having l, m element Inline graphic. The columns of Γ are weighted vectors of source-population allele frequency deviations. The number of generations elapsed since the admixture pulse is assumed to be unknown. I do not propose to explicitly model time-dependent properties, but rather to model the effect of recent admixture on observed covariances. Hence, the dependence on t is dropped in the final expression of (9).

The properties of Inline graphic and Ψ:

The theory of population genetics predicts that covariances between unlinked loci will be negligible in the source populations, if the source populations are not very small. Here, I address the consequences of this prediction for Ψ, along with other properties of Inline graphic and Ψ.

In terms of understanding and specifying the model, much is gained by making clear how model parameters are derived from stochastically varying quantities. To describe the association between alleles at loci l and m in the kth source population, I imagine a 2 × 2 table, formed from the set of gamete counts

graphic file with name M28.gif (10)

The counts are arranged in the table in some natural way, e.g., by assigning rows to the alleles at locus l and columns to the alleles at locus m. The following discussion parallels Weir (1996, pp. 112–114), with the important exception that here the entire kth source population at generation zero plays the role of the “sample” in Weir's development. Under the Wright–Fisher model, the gamete counts in (10) are random variables, having a multinomial distribution with parameters 2N(k) and Inline graphic; I specify the probabilities Inline graphic in more detail below for the case of unlinked loci. I view the l, m-locus gamete counts in the kth source population at generation zero as a particular realization of (10), and the allele frequencies pl(k), pm(k) appearing in Figure 1 and Equation 5 as marginal frequencies of the realized 2 × 2 table. The covariance between xl and xm in the kth source population, written as a function of the random gamete counts, is

graphic file with name M31.gif (11)

I view the parameter Dlm(k), appearing in Figure 1 and on the right-hand side of Equation 5, as a realization of 𝒟lm(k).

For unlinked loci l, m and 2N(k) sufficiently large, gamete frequencies follow the “independent loci” model (Ethier 1979); under this model, the probability distribution on 2 × 2 tables has the property of row-by-column independence. Let πl(k) (respectively, πm(k)) be the probability that a randomly chosen gamete of the kth source population carries the “1” allele at locus l (m). Under the independent loci model, the multinomial probabilities Inline graphic are π11(k) = πl(km(k), Inline graphic, Inline graphic, and Inline graphic. It is worth noting that the Wright–Fisher process in the kth source population need not be stationary in order for unlinked l, m to follow the independent loci model (Ethier 1979). Further, for large source populations, even weakly linked loci (those with clm < Inline graphic, but clm not very small) can be treated as independent (Ethier 1979).

By straightforward calculation, under the independent loci model, E(𝒟lm(k)) = 0, where the expectation is over multinomial samples forming the kth source population at generation zero. For 2N(k) sufficiently large, a calculation using the “delta method” gives

graphic file with name M37.gif (12)

(see Weir 1996, Equations 2.17, 3.10), again assuming the independent loci model. As a consequence of the above, when l, m are unlinked and 2N(k) is sufficiently large, the parameter Dlm(k) in Equation 5 can be approximated by a normal deviate having mean zero and variance Q(k)/2N(k), where 0 < Q(k) < 1. Provided that all of the source populations are sufficiently large, the parameters Dlm(1), … Dlm(K) for unlinked l, m may then be viewed as a set of small, independent fluctuations about zero. I assume that the weighted average of these fluctuations, Inline graphic, can be equated to zero for unlinked l, m. The fact that unlinked loci l, m are assigned to different blocks leads to the observation that

graphic file with name M39.gif

Consequently, Ψ is of block-diagonal form:

graphic file with name M40.gif (13)

where Ψb is of dimensions Lb × Lb; b = 1,…, B.

The sub-matrices Ψb, formed by locus pairs l, m in the same block, contain covariances attributable to the shared ancestry of linked loci. Calculations of Ohta and Kimura (1969) suggest the approximation Inline graphic, for linked loci with recombination fraction clm, where Inline graphic. This variance characterizes the order of magnitude of Dlm(1),…Dlm(K) when l, m are linked. For Inline graphic, Var(𝒟lm(k)) is nonnegligible, compared to the scale of a sample covariance slm. In the pulse-decay model, the Dlm(k) for linked l, m are averaged over source populations in order to form the elements of Ψ1,…, ΨB. In contrast to the elements of the off-diagonal-blocks, I do not equate the elements of Ψ1,…, ΨB, to zero; yet, some elements of Ψ1,…, ΨB may indeed be close to zero, if the covariances Dlm(k) for linked loci tend to be of opposite signs in different source populations, and if the weights q(1),…, q(K) are nearly equal. A scenario in which one source population has a relatively large mixing proportion q leads to a straightforward interpretation of the elements of Ψ1,…, ΨB: here, the element ψlm mainly reflects the covariance between l, m in the dominant source population.

The matrix Inline graphic contains components of covariance resulting only from differences in allele frequencies between source populations. These components are independent of any notion of physical linkage between loci, and could be considered entirely artifacts of admixture.

Inline graphic and Ψ are symmetric by definition, and the following proposition establishes that both have the other essential property of covariance matrices. A proof is given in the appendix.

Proposition 1. Inline graphic and Ψ are nonnegative definite.

In the remaining sections, I write M ≥ 0 to indicate that a symmetric matrix M is nonnegative definite, and M > 0 to indicate that M is positive definite.

The multiple battery factor analysis model:

The matrix Inline graphic containing admixture covariances has a factorization given by the following proposition; the covariance structure model is then an immediate consequence. A proof is given in the appendix.

Proposition 2. There exists a matrix ΛL×(K–1) of full column rank such that Inline graphic.

Proposition 2 establishes that the rank of Λ is one less than the number of linearly independent source population vectors p(1),…, p(K). Linear dependencies among p(1),…, p(K), which could arise if the source populations are related by recent descent from an ancestral population, or are themselves connected by migration, will reduce the rank of Λ accordingly.

The matrix equation

graphic file with name M49.gif (14)

with Ψ block-diagonal, is a covariance structure model associated with the “inter-battery” (for B = 2 blocks, Tucker 1958) and “multiple battery” (for B ≥ 3, Browne 1980, Browne and Tateneni 1997) factor analysis models of psychometrics. The preceding development has shown that the covariance structure Equation 14 is implied by the pulse-decay model. In psychometrics, (14) is the covariance structure implied by the linear latent variable model

graphic file with name M50.gif (15)

where μL×1 and ΛL×(K–1) are unknown parameter matrices, fInline graphicK−1 is a vector of unobserved “inter-battery” factors with E(f) = 0 and Var(f) = I, and zInline graphicL is a vector of unobserved “battery-specific” factors, independent of f, with E(z) = 0 and Var(z) = Ψ (see Browne 1980). In ordinary factor analysis, Ψ is a diagonal matrix containing the “unique variances” of the elements of x (Lawley and Maxwell 1971). The present statistical problem is to estimate Λ and Ψ given an observation S of Σ.

Psychometric studies of “difficulty factors” (see, e.g., McDonald and Ahlawat 1974) caution against the naïve application of models (14) and (15) to binary data. However, alternative models for binary observations, including latent class and response function models (see e.g., Lazarsfeld and Henry 1968; Bartholomew and Knott 1999; Satten et al. 2001), assume that all pairs of observed variables are conditionally independent given the latent variables, and thus do not allow for the within-block covariances encoded in Ψ. The block-diagonal structure of Ψ is an essential aspect of both the genetic model and the statistical methods described here.

ESTIMATION OF MODEL PARAMETERS

I first describe the basic model-fitting problem for an arbitrary number of blocks B; then in subsequent sections I discuss the cases B = 2 and B ≥ 3, and give two data examples.

For Λ with a given column rank K – 1, the parameters of the model (14) are said to be identified if the map from (Λ, Ψ) to Σ is one-to-one, i.e.,, if

graphic file with name M53.gif

(Jöreskog 1981). In practical terms, the identifiability of a model (or lack thereof) determines whether or not unique estimates of model parameters can be found, given the sample. When parameters of (14) are not identified, the strategy I adopt below is to obtain estimates which satisfy a generalized least-squares criterion. In particular situations I will describe, the parametrization Σ = Inline graphic is partly or wholly identified.

We know from population genetics that sample allele frequencies impose constraints on observed LD values, so it is natural to ensure that the same constraints apply to model estimates. As the diagonal element sInline graphic is the sample variance of a binary variable, clearly

graphic file with name M56.gif (16)

For distinct loci l, m, let Inline graphic and Inline graphic, where Inline graphic, Inline graphic are the sample frequencies of the “1” allele at loci l and m. The inequality constraints

graphic file with name M61.gif (17)

(Lewontin 1964) derive from the fact that two-locus gamete frequencies Inline graphic are nonnegative. Thomson and Baur (1984) described an additional set of inequality constraints on the elements of S, deriving from the fact that three-locus gamete frequencies Inline graphic are nonnegative. For the triplet of distinct loci l, m, j, the constraints can be written as

graphic file with name M64.gif (18)
graphic file with name M65.gif (19)
graphic file with name M66.gif (20)
graphic file with name M67.gif (21)

Robinson et al. (1991) determined that no additional constraints on the elements of S are imposed by considering loci in groups of four or more; thus, a complete set of inequality constraints can be obtained by applying (16) to the diagonal elements of S, (17) to the elements involving distinct locus pairs l, m, and (18)–(21) to the elements involving distinct locus triplets l, m, j. I will say that model estimates Inline graphic and Inline graphic satisfy the constraints 𝒞 if (16)–(21) hold, upon replacing sample quantities slm by their estimates Inline graphic, for l, m ∈ {1, … L}.

I use generalized least-squares (GLS) methods for fitting the model (14) to data for two main reasons: these methods do not require explicit distributional assumptions, and the statistical properties of GLS estimates are known under fairly general conditions (see, e.g., Browne 1974, 1984; Jöreskog 1981). For a “weight” matrix VL×L > 0 yet to be chosen, GLS estimates of Λ and Ψ are obtained by minimizing the discrepancy function

graphic file with name M71.gif (22)

on the set

graphic file with name M72.gif

where ‖ · ‖ is the Euclidean norm. I also examine the unconstrained problem, in which (22) is minimized on {(Λ, Ψ): rank(Λ) = K – 1, Ψ0}; for, a minimizer Inline graphic of the unconstrained problem may, for a given data set, satisfy the constraints 𝒞. In such a situation, Inline graphic is also the minimizer of the constrained problem. When V is not a function of Λ or Ψ, straightforward matrix differentiation (see, e.g., Magnus and Neudecker 1988, section 9.9) gives the estimating equations

graphic file with name M75.gif (23)

and

graphic file with name M76.gif (24)

where for ML×L, Bdiag[M] gives M the same block-diagonal structure as Ψ. Equations 23 and 24 give necessary conditions for a local (or global) minimum of fV, at a differentiable point of the parameter space (Magnus and Neudecker 1988, Sect. 7.4).

In the following sections, I use weight matrices V > 0 having the same block-diagonal structure as Ψ. There are two main reasons for this choice: first, block-diagonal V can be chosen to reflect unique features (levels of polymorphism, etc.) of particular blocks; second, block-diagonal V leads to a simple estimating equation for Ψ. For block-diagonal V, pre- and post-multiplication by V−1, along with straightforward matrix algebra, establishes that the necessary condition (24) is equivalent to

graphic file with name M77.gif (25)

Consequently, model estimates within diagonal blocks equal their observed counterparts:

graphic file with name M78.gif (26)

THE MODEL WITH B = 2 BLOCKS

For B = 2, it is convenient to write the two blocks of a gamete vector as Inline graphic and Inline graphic, with Lx + Ly = L. The covariance structure model (14) is then

graphic file with name M81.gif (27)

where Λx, Λy are respectively of dimensions Lx × (K – 1), Ly × (K – 1). I assume without loss of generality that LxLy.

Unfortunately, Λ and Ψ are essentially unidentified for the model with B = 2 blocks. Given Inline graphic, Inline graphic, Inline graphic, and Inline graphic, which solve (27), many new solutions can be generated by the family of transformations

graphic file with name M86.gif

where M is any conformable, invertible matrix. All of these solutions will satisfy (27) for a given Σ; thus we need to impose additional criteria to choose a “good” solution from the many. Although Λ and Ψ are not identified, the off-diagonal blocks Inline graphic, under the parametrization Inline graphic, are identified as

graphic file with name M89.gif

This parametrization is less informative mathematically, but shows that matrix elements describing admixture LD between loci of different blocks are unique.

I advocate a solution of model (27) first described by Rao (1973, 1979), and here give a rationale for its use in the admixture decomposition. I assume in the following that Sxx and Syy are nonsingular and that

graphic file with name M90.gif

The latter is a technical assumption which ensures that each latent dimension of the admixture model is represented in the sample. The singular value decomposition (SVD) of Inline graphic is accordingly

graphic file with name M92.gif (28)

where Inline graphic and Inline graphic are orthogonal, Inline graphic,

graphic file with name M96.gif

and r1 ≥ ··· ≥ rd are the dLx singular values.

The Rao (1979) solution for the model with B = 2 blocks is

graphic file with name M97.gif (29)

where Δ(K–1) = diag(r1, … rK–1), and U(K–1), W(K–1) respectively contain the columns of U, W corresponding to the ordered singular values in Δ(K–1). The singular values r1 ≥ ··· ≥ rd of SInline graphicSxySInline graphic, all of which lie in [0, 1], are the sample canonical correlations between the variables x and y.

The Rao solution can be connected to the unconstrained GLS problem by a particular choice of V in (22), and by use of the corresponding necessary condition ∂fV/∂Ψ = 0. A proof of the proposition is given in the appendix.

Proposition 3. For Inline graphic, the Rao solution (29) gives a global minimum of fV(Λ, Ψ; S) on the set {(Λ, Ψ): rank (Λ) = K – 1, Ψ ≥ 0}.

Thus the Rao solution is the GLS solution for the unconstrained problem, given the choice of V immediately above. Browne (1979) shows that the Rao solution (29) gives maximum-likelihood estimates of (Λ, Ψ) in the multivariate normal setting, under additional identifiability conditions. The connection given by Proposition 3 between the Rao solution and the minimization of the GLS discrepancy function is apparently new.

There is no guarantee that the Rao solution (29) will satisfy the constraints 𝒞 imposed by a given data set. Numerical methods for minimization of the discrepancy function (22), which explicitly incorporate inequality constraints, have been described by Lee (1980); these remain an option if the fit of the Rao solution is unsatisfactory.

Data from The SNP Consortium:

This is the first of two data examples using SNPs organized into two blocks on the human X chromosome. My aim is more to show a proof of principle rather than to demonstrate any elaborations of the analysis. Although the examples may seem rather narrow, I do not intend that the method be limited only to very similar situations. In both examples I use only X chromosomes typed in males, in order to avoid any complications introduced by either inferring phase or estimating pairwise LD by indirect methods. Both data sets are, in a broad sense, deliberately admixed, as they combine population samples meant to characterize human ethnic or geographic diversity. However, I will not assume that ethnic or geographic labels necessarily correspond to source populations, even if evidence for admixture is found.

The first data set is a sample of n = 110 independent X chromosomes, typed by Celera Genomics as part of The SNP Consortium (TSC) Linkage Map Project (Clark et al. 2003; Matise et al. 2003). The n = 110 males are comprised of the following (presumably self-identified) groups: 13 Asian males from the Celera diversity panel of 30 unrelated Asians; 15 African-American males from the Celera diversity panel of 30 unrelated African Americans; and 82 Caucasian males from the Utah CEPH pedigrees typed in “phase 2” of the Linkage Map Project. I selected CEPH males who could provide independent X chromosomes: typically for a three-generation pedigree, I included the father's and paternal grandfather's X chromosomes, along with a single X chromosome chosen at random from the maternal grandfather and sons.

The 11 SNPs used in this example are a subset of 39 X-linked SNPs typed by Celera Genomics in the diversity panels and CEPH Utah pedigrees. Two broad requirements guided the selection of these SNPs. First, the SNPs can be assigned to two blocks located on different arms of the X chromosome (see Figure 2 for the physical map). Second, the number of parameters estimated in fitting the model with 11 SNPs is moderate, compared to the sample size n = 110. With 11 SNPs there are 11 × 10/2 = 55 population covariances estimated by sample statistics, along with 11 allele frequencies. There are at most 5 canonical correlations to include in the model, determined by the size of the smaller block. Once the number of canonical correlations to retain has been decided (by statistical inference), the other components of the Rao solution are uniquely determined. Thus in the present case there are at most 55 + 11 + 5 = 71 quantities to be estimated in fitting the model.

Figure 2.—

Figure 2.—

Physical map of the X-linked SNPs used in the SNP Consortium sample. Reference SNP (“rs”) identifiers were obtained by searching the original SNP identifier (a numeric code preceded by “TSC”) at the dbSNP web page, http://www.ncbi.nlm.nih.gov/projects/SNP/. Map positions are given in the files TSCmap.p1.Celera.txt.gz and TSCmap.p2.Celera.txt.gz, obtained from http://snp.cshl.org/. The centromere location (∼60 Mb) was obtained from the UCSC Genome Browser (http://genome.ucsc.edu/index.html; May 2004 assembly). Of the 39 X-linked SNPs typed by Celera, 20 SNPs for which multiple individuals had missing genotypes or obvious typing errors (e.g., heterozygous males) were eliminated. One additional SNP having a minor allele frequency of ∼0.05 was eliminated as being relatively uninformative for inferences about LD. Of the remaining 18 SNPs, 8 were on the left arm, and 10 were on the right arm of the X chromosome. The 5 most distal SNPs of the left arm and the 6 most distal SNPs of the right arm form the two blocks and are shown in the figure.

Inference of the number of source populations:

In the case of B = 2 blocks, fitting the Rao solution (29) turns the problem of inferring the number of independent sources into that of inferring the number of nonzero canonical correlations of the population LD matrix. Although test-statistics for making this inference exist, their null distributions are known only when the data are multivariate normal, or more generally are from an elliptically-contoured family (Muirhead and Waternaux 1980). I use a bootstrap approach in order to avoid making assumptions about the data distribution. It will be convenient to define some basic sample quantities: for i = 1,…n, let Inline graphic, Inline graphic be the x and y blocks of the ith gamete, and let Inline graphic, Inline graphic be the centered data matrices. The five sample canonical correlations for the TSC data are 0.46, 0.35, 0.25, 0.16, and 0.08.

The initial inference to be made is whether or not there is any support for admixture. For the pulse-decay model, this is equivalent to testing the null hypothesis that there is only one source population, hence no contribution from the term ΛΛ′ in (14). Under this hypothesis Σ itself is block-diagonal. A likelihood-ratio test for the null hypothesis of blockwise independence (block-diagonal covariance structure), assuming multivariate normal data, is well known (see, e.g., Anderson 1984, Chap. 9; Seber 1984, Sect. 3.5). For B = 2 blocks, the test-statistic is

graphic file with name M105.gif (30)

where r1 ≥ ··· ≥ rd are the sample canonical correlations (Anderson 1984, Sect. 12.4.1). For multivariate normal data, the sample value t0 is compared to the χ2 distribution on LxLy degrees of freedom.

The null hypothesis of blockwise independence suggests a natural permutation strategy that I use to obtain a null distribution for T0: if the sample consists of pairs (x1, y1) , … , (xn, yn), then a permuted sample under blockwise independence is (x1, yπ(1)), … , (xn, yπ(n)), where π is a uniform random permutation of {1, … , n}. I computed the test-statistic T0 for each of R = 9, 999 permuted samples, and obtained an approximate p-value of

graphic file with name M106.gif

for the test of blockwise independence, where tInline graphic, … , tInline graphic are the values of T0 for the permuted samples (see Davison and Hinkley 1997, Sect. 4.3). There is then good evidence against simple block-diagonal structure in the TSC data.

The permutation test does not extend readily to sequential tests which add nonzero canonical correlations to the model one at a time. So I proceed by computing sequential confidence regions, for the first canonical correlation, then for the first and second canonical correlations jointly, etc., by nonparametric bootstrapping. I take inclusion of the value zero in a confidence region as evidence that the corresponding correlation is nonsignificant.

Construction of a confidence interval for ρ1 is the first step. I use a bootstrap approach described by Davison and Hinkley (1997, section 2.7.2) which employs smooth transformations of sample statistics and corresponding variance estimates. An approximate pivotal quantity can be obtained from the first sample canonical correlation r1 using Fisher's (1921) transformation

graphic file with name M109.gif (31)

along with a variance approximation calculated from empirical influence values. The variance approximation is

graphic file with name M110.gif

where lj is the empirical influence value for the jth observation on r1 (see Davison and Hinkley 1997, Equation 2.38). Gu and Fung (1998) show that lj can be written as

graphic file with name M111.gif (32)

where sj1 and tj1 are the jth elements of Inline graphic and Inline graphic, respectively, and Inline graphic, Inline graphic are the sample canonical vectors associated with r1. The basic quantity to be bootstrapped is then Inline graphic. The function boot.ci, available in the boot library of the statistical software R (R Development Core Team 2005, version 2.2.1), handles the confidence interval calculations readily. A 95% studentized confidence interval for ρ1 for the TSC data, obtained from R = 9, 999 ordinary nonparametric bootstraps, is (0.18, 0.47), confirming the existence of a first canonical correlation clearly distinct from zero. As the sample canonical correlation is r1 = 0.46, r1 is apparently highly biased as a naïve estimator of ρ1. This can be confirmed by performing basic bootstraps of r1.

Bootstrapping the first two or more canonical correlations together requires additional care: the nonindependence of the canonical correlations (imposed by their rank ordering) calls for joint bootstrapping and the use of a covariance matrix reflecting the joint structure. For a joint bootstrap of the first two canonical correlations, each transformed according to Fisher's function, the covariance matrix estimate based on empirical influence values is d𝒱d, where Inline graphic,

graphic file with name M118.gif (33)
graphic file with name M119.gif

Inline graphic, Inline graphic, and Inline graphic, Inline graphic are the sample canonical vectors associated with r2 (see Hampel et al. 1986, Equation 4.2.2; Davison and Hinkley 1997, Sect. 5.8). The basic quantity to be bootstrapped is then the quadratic form

graphic file with name M124.gif (34)

where

graphic file with name M125.gif

The quantile Inline graphic is the key quantity recovered from the bootstrap samples; from this, a 100(1 – α)% joint confidence region for (ρ1, ρ2) is obtained as

graphic file with name M127.gif (35)

(see Davison and Hinkley 1997, Sect. 5.8). I checked the stability of the required matrix inversion by a preliminary bootstrap of the singular values of d′𝒱d. In the present case, d′𝒱d was invertible in all R = 9, 999 bootstrap samples.

Figure 3 shows joint confidence regions for (ρ1, ρ2) based on R = 9, 999 bootstrap samples. The joint confidence regions clearly support the inference ρ1 > 0, and give weaker, though measurable, support for ρ2 > 0. Before settling on a model with two canonical correlations, support for ρ3 > 0 needs to be investigated. I used the natural extensions of (33)–(35) to three variables, to compute joint confidence regions for (ρ1, ρ2, ρ3) based on R = 9, 999 bootstrap samples. With three parameters the confidence regions are nested three-dimensional surfaces, similar to the Russian dolls known as “matreshki.” The nested surfaces are not easy to display graphically, so in Figure 4 I show the intersections of the surfaces with the two-dimensional (ρ1, ρ2) plane, in which ρ3 = 0. Each of the three confidence regions intersects the plane, so I conclude that there is little support for a third canonical correlation. Thus the inference is that there are two significant correlations in the TSC data, indicating the presence of three independent sources of variation.

Figure 3.—

Figure 3.—

Joint confidence regions for (ρ1, ρ2) for the TSC data, obtained by nonparametric bootstrapping. The 45° line and horizontal line through zero mark the region on which (ρ1, ρ2) are defined. Confidence regions are shown as contours of the quadratic form q, given by expressions (34) and (35).

Figure 4.—

Figure 4.—

Intersections of the joint confidence regions for (ρ1, ρ2, ρ3) with the plane ρ3 = 0. The contours enclose portions of the plane ρ3 = 0 contained in the three-dimensional confidence regions; thus, for a given confidence level, the area within the contour indicates how much support there is for the inference ρ3 = 0.

Other methods for the TSC data:

The computer program Structure, available at http://pritch.bsd.uchicago.edu/structure.html, implements Bayesian methods for fitting admixture models described in Pritchard et al. (2000) and Falush et al. (2003). Pritchard et al. (2000) describe a method for inferring the number of source populations K, based on estimates of the probability of the sample given different values of K. I used Structure, along with rules of thumb described in the accompanying user's manual, to infer the number of source populations for the TSC data, for comparison with the inference made above. For each of three models of ancestry implemented in Structure (the “no admixture”, “admixture” and “linkage” models), I carried out five independent runs of the Structure program at each value of K = 1, … , 5 (for a total of 75 runs). For each run, a “burn-in” of 100,000 iterations was followed by 1,000,000 Markov-chain Monte Carlo updates. Arguably, none of the ancestry models implemented in Structure are appropriate for the TSC SNPs, some of which are very tightly linked. Each ancestry model gave strongest support to the value K = 1 for the TSC data. For models with K > 1, in all runs, the overall sample proportion assigned to a “cluster” was 1/K for each cluster. These results are understood to indicate an absence of population structure.

The programs in Eigensoft (http://genepath.med.harvard.edu/∼reich/EIGENSTRAT.htm) perform a principal components analysis described by Patterson et al. (2006) for inference of admixture. Under this method, K – 1 is inferred as the number of statistically significant eigenvalues of an interindividual covariance matrix [having a form similar to expression (4)]. As for Structure, I have reservations about using Eigensoft on the TSC sample, because the data architecture is of a form not intended by Patterson et al. (2006). In particular, Patterson et al. intend (but do not require) that the number of markers L be larger than the number of individuals n, and acknowledge that their method is not designed specifically to accomodate markers in linkage disequilibrium. I applied the Eigensoft genotype coding scheme to the TSC X chromosomes as follows: a male with the “reference” allele (chosen arbitrarily) at an X chromosome locus has the value 0 at the locus, whereas a male with the “variant” allele has the value 2. I used the regression method described in Patterson et al. (2006) to correct for LD, and varied the number of adjacent markers used for adjustment between 1, 2 and 3. In each case Eigensoft called all eigenvalues nonsignificant, thus the inference would be that only one population contributed to the TSC sample.

The admixture decomposition:

The Rao solution can be computed by straightforward matrix operations commonly implemented in statistical or numerical packages. I do the calculations in R (R Development Core Team 2005, version 2.2.1), and describe a numerically stable method for computing the decomposition in the appendix.

The Rao decomposition for the TSC databased on two canonical correlations is shown graphically in Figure 5. The estimates Λ*Λ*′ and Ψ* satisfy the constraints 𝒞 imposed by the sample allele frequencies. Covariances due to admixture, involving rs1541354 and rs1859004 (respectively, the third and fourth SNPs from the left in Figure 5) with several of the other SNPs, can be seen as vertical bands in the third and fourth columns of Λ*Λ*′. A comparison of the off-diagonal block of the residual matrix with the corresponding block of S reveals a notable tendency for residual covariances to be corrected toward zero.

Figure 5.—

Figure 5.—

Graphical representation of the covariance structure model fitted to the TSC data. The shaded squares depict absolute values of matrix elements below the main diagonal, for the sample covariance matrix S, the matrix of estimated admixture covariances Λ*Λ*′ (for the model with two canonical correlations), and the residual matrix SΛ*Λ*′. The key, top right, gives the shading scheme applied to absolute values of matrix elements. The off-diagonal block containing covariances between SNPs of different blocks is marked by thick lines. The estimate Ψ* is obtained by setting the elements of the off-diagonal block of the residual matrix equal to zero.

Data from the International HapMap Consortium:

In broad terms, the second example repeats the analysis of the TSC data, using a different sample of X chromosomes typed by the International HapMap Consortium (2005), HapMap Public Release #21, http://www.hapmap.org. Here I want to demonstrate that a different sample having similar structure yields similar results. The HapMap sample consists of 53 males of the Yoruban parent-offspring trios from Ibadan, Nigeria; 23 Japanese males from Tokyo; 22 Han Chinese males from Beijing; and 44 males of the Utah CEPH parent-offspring trios. Some of the Utah CEPH pedigrees typed by Celera Genomics for TSC were also typed by HapMap; thus 20 of the Utah males I include in the HapMap sample were also in the TSC sample.

Although HapMap typed many more SNPs overall than TSC, the particular SNPs I used for the TSC sample were incompletely typed by HapMap. I elected to choose a new set of SNPs for the HapMap sample, located in the same X-chromosome regions represented by blocks 1 and 2 of the TSC sample. The HapMap browser locates the TSC SNPs of block 1 in the X-chromosome region 9.2 - 13.7Mb, and the SNPs of block 2 in 139.3 - 147.3Mb. I judged that a model with 15 SNPs would preserve a reasonable balance between sample size (n = 142) and number of parameters. I chose 15 SNPs at random from those typed by HapMap, allocated to blocks 1 and 2 in rough proportion to the sizes of the two regions. Prior to choosing the 15 SNPs at random, I excluded SNPs from the two regions which had obvious genotyping errors or which failed to type in one or more individuals. I discarded approximately three random SNP sets because the sample covariance matrices they produced were singular, or because the Rao decomposition based on 1, 2 or 3 canonical correlations would have failed to satisfy the constraints imposed by the sample allele frequencies. The latter biased the selection toward a SNP set which yields a good fit of the model, but did not privilege any particular value of K. Figure 6 shows the physical map of the 15 chosen SNPs.

Figure 6.—

Figure 6.—

Physical map of the X-linked SNPs used in the HapMap sample. Details of the SNP selection procedure are given in the text. The flat files containing raw data and SNP positions are available at http://www.hapmap.org/.

I used the same data analysis strategy for the HapMap sample as for the TSC sample, so I will summarize results briefly. The six sample canonical correlations for the HapMap data are 0.50, 0.41, 0.33, 0.23, 0.20 and 0.09. The permutation test of the statistic T0 produced an approximate p-value of 0.0007, indicating the presence of at least one significant canonical correlation. A 95% confidence interval for the first correlation is (0.25, 0.49), and joint confidence regions for the first two correlations are shown in Figure 7. The HapMap plot (not shown) analogous to Figure 4 has no contours in the (ρ1, ρ2) plane: thus ρ3 = 0 is excluded from each of the 95, 90, and 85% confidence regions. I take this to imply support for a third correlation. Graphical display of 4-dimensional regions is highly problematic, so I checked for the presence of a fourth significant correlation numerically. I bootstrapped the four-dimensional analog of (34) in order to estimate quantiles q*, then evaluated (34) at candidate points (ρ1, ρ2, ρ3, ρ4 = 0), with ρ1, ρ2 and ρ3 varying in a region of high support. I found many such parameter combinations inside the joint confidence regions, and concluded that there is little evidence in the HapMap sample for a fourth correlation. Thus the inference is that there are three significant correlations in the HapMap data, indicating the presence of four independent sources of variation. The Rao decomposition for the HapMap data, based on three canonical correlations, is shown graphically in Figure 8. The decomposition satisfies the constraints 𝒞.

Figure 7.—

Figure 7.—

Joint confidence regions for (ρ1, ρ2) for the HapMap data, analogous to Figure 3. The HapMap plot analogous to Figure 4 has no contours in the ρ3 = 0 plane.

Figure 8.—

Figure 8.—

Graphical representation of the covariance structure model fitted to the HapMap data, analogous to Figure 5. The decomposition uses three canonical correlations.

Other methods for the HapMap data:

I followed the same strategy using Structure to infer K for the HapMap data as for the TSC data. In contrast to the TSC data, Structure finds evidence for multiple clusters in the HapMap data; in particular, the “linkage” model produces evidence for four clusters. In five runs of the linkage model with K = 4, Structure consistently assigned overall sample proportions near 0.27 to three clusters, and 0.2 to a fourth cluster. Posterior probabilities of cluster membership suggest that most individuals are of mixed ancestry, having roughly equal proportions of their alleles assigned to two or more clusters.

I followed the same strategy using Eigensoft for the HapMap data as for the TSC data. Eigensoft calls one eigenvalue significant when either 1 or 2 adjacent markers are used for LD adjustment, and calls no eigenvalues significant when 3 adjacent markers are used. Thus Eigensoft suggests that at most two sub-populations have contributed to the HapMap X chromosome data.

THE MODEL WITH B ≥ 3 BLOCKS

Under mild conditions, the model with B ≥ 3 is identified up to post-multiplication of Λ by an orthogonal matrix (i.e., up to rotations of Λ). However, explicit formulae for parameter estimates are not readily available.

To fix notation, the block-partitioned parameter matrices are Ψ, of the form (13), and

graphic file with name M128.gif

where Λb is Lb × (K – 1); b = 1 , … , B. For B ≥ 3, Browne (1980, Proposition 2) shows that a sufficient condition for identification of Ψ, and of Λ up to rotations, is that at least three of the sub-matrices Λb of Λ be of full column rank.

An assumption of the pulse-decay model, built on the notion that Inline graphic, is that the source-population vectors p(1), … p(K) are linearly independent. If Lb, the number of markers in the bth block, is sufficiently large, it is reasonable to assume that linear independence also holds for the block-specific sub-vectors pb(1) , … , pb(K). In fact, provided LbK – 1, a block-specific version of Proposition 2 (of this paper) is readily established, using suitably defined matrices Γb, Inline graphicbb, and a proof of the same form. Browne's Proposition consequently implies that Ψ is identified, and Λ is identified up to rotations, if LbK – 1 for at least three blocks b ∈ {1, … B}. Furthermore, when at least three blocks meet this minimum size criterion, the matrix Inline graphic = ΛΛ′ is unique: for Inline graphic = ΛΛ′ = ΛQ (ΛQ)′ when Q is orthogonal.

Obtaining an analog of Proposition 3 (of this article) for B ≥ 3 appears to be an open problem. Using

graphic file with name M133.gif

Ψ = B diag[SΛΛ′] and the reasoning of Proposition 3, GLS estimates ΛInline graphic , … , ΛInline graphic for the unconstrained problem would be obtained by minimizing

graphic file with name M136.gif (36)

It is here that the strategy used for the model with B = 2 blocks breaks down, for although the Eckart–Young theorem (Gower and Hand 1996, Sect. A.4.4) can be used to minimize the individual summands of (36), it is not clear how to reconcile the different expressions for ΛInline graphic obtained as i varies.

At present, numerical estimation of (Λ, Ψ) appears to be the only feasible option for the model with B ≥ 3. Lee (1980) gives algorithms for GLS estimation under parameter constraints, which are appropriate for the data under consideration. For inference of K, an ad hoc strategy involving sequential goodness-of-fit tests, similar to a method used in ordinary factor analysis (see, e.g., Bartholomew and Knott 1999, Sect. 3.8–3.9), may be pursued. For a candidate value K and a GLS estimate Λ*, where Λ* has K – 1 columns, the fit of the model may be assessed by testing the residual matrix SΛ*Λ*′ for block-diagonal structure (see e.g., Anderson 1984, Chap. 9; Seber 1984, Sect. 3.5); this assumes that such tests could be suitably adapted to multivariate binary observations. The smallest candidate value Inline graphic providing the required block-diagonal residual matrix may be taken as an estimate of K.

DISCUSSION

Here I have shown that a multilocus model of pairwise LD in an admixed population implies a particular covariance structure model (Equation 14), and that the model can be fitted feasibly to data. The simple pulse-decay framework will be implausible for many samples; nonetheless it may be useful to fit the statistical model, even when the form of admixture is clearly more complex, in order to discover or describe covariance structure. In this case, the “existence” result which connects the admixture model to (14) cannot be invoked: instead, it should be acknowledged that a convenient model of quasi-independent sources of covariance is being imposed on the data. The model (14), along with the statistical methods for fitting it, are of intrinsic interest and will likely have further applications in genomics. Regrettably, this description is too long to include work on a more general admixture model, in which source populations contribute to an admixed population at each generation (e.g., the “gradual” admixture model of Ewens and Spielman 1995). Preliminary calculations (not shown here) suggest strongly that a decomposition resembling Equation 9, involving a block-diagonal matrix and several non-block-diagonal matrices, will hold for the more general model. Questions about the ranks of these matrices and about statistical inference in the general model remain for future work.

I can suggest two immediate, practical applications of this work. First, the model provides a well-supported statistical method for inferring the number of source populations contributing to an admixed population, for samples in which SNP markers can be organized into two blocks. In multiblock samples, the same method may still be applied by focusing attention on two well-chosen blocks. A complete extension of the method to multiblock samples is a desirable objective for future work. For the TSC and HapMap samples, I used canonical correlations to infer the presence of K = 3 and K = 4 independent sources of covariance, respectively. However, the inferred sources need not correspond to the three (respectively four) groups defined in these samples by ethnic labels. Under the model, SNPs, not labeled individuals, come from independent sources.

A second application is the detection, through examination of Inline graphic* = Λ*Λ*′, of SNPs worthy of scrutiny in an association mapping study, biogeography survey, or other context in which genetic mixtures are of interest. In both data examples considered here, examination of Λ*Λ*′ revealed SNPs that had relatively large admixture LD in pairwise combinations with other markers. Such SNPs would be more likely to show “spurious” associations with phenotypic traits, in association-mapping studies not designed to detect admixture effects. The elements of Ψ* are components of LD remaining after admixture artifacts have been removed; thus Ψ* contains observed LD values adjusted for admixture effects. These quantities may provide suitable surrogates for physical linkage in admixed populations, similar to conventional LD measures in unstructured populations. We can imagine an association-mapping scenario in which a binary trait has been roughly mapped to a linkage group (say, represented by block 1), and it is desired to fine-map the trait in a potentially admixed population. We include the trait as a variable in block 1, and fit the covariance structure model to block 1 along with a second unlinked block, to detect associations between the trait and markers due to admixture alone. The elements of Ψ* then give corrected components of LD between the trait and markers of block 1 and suggest which markers lay in close proximity to the trait locus.

I do not use the model to make inferences at the level of the individual, concerning, e.g., the proportion of alleles on a gamete inherited from a putative source population. The fine-grained information provided by the model consists of the matrix elements of the decomposition. Progress toward individual inference may be possible by noting that the canonical variable pairs (sk, tk); k = 1, … d, used here to calculate empirical influence values, are also used in classification and assignment (see Krzanowski and Marriott 1995, Sect. 4.11). Formal inference at the individual level apparently requires estimation of individual factor scores f in Equation 15. These factor scores are indeterminate from the view of multivariate analysis (see, e.g., Lawley and Maxwell 1971, Chap. 8; Mardia et al. 1979, Sect. 9.7): a single sample cannot be used to simultaneously estimate Λ, Ψ, and the individual factor scores f1 , … , fn.

A reviewer asked if the model and statistical methods still apply when phase-unknown genotypes, rather than gametes, are observed. Only the method for inference of the number of significant canonical correlations needs to be modified for genotype data. Pairwise composite LD estimates, which require only unphased genotypes (Cockerham and Weir 1977; Weir 1979), can be organized in matrix form to estimate the population covariance matrix. The covariance structure model (14) and the generalized least-squares method depend only on the covariance matrix and sample allele frequencies. Fitting the Rao solution (29) for B = 2 blocks requires a decision about the number of canonical correlations to include; although here I bootstrapped gametes to make the decision, other avenues are available. There is no impediment to bootstrapping unphased genotypes, although to perform calculations analogous to these, empirical influence values for genotypes will need to be derived, perhaps using techniques of Romanazzi (1992) or Gu and Fung (1998). Genotype-based influence values could then be used similarly to (32) and (33) in variance approximations. Dümbgen (1998) gives an elegant method for constructing simultaneous confidence intervals for canonical correlations. The numerical implementation of Dümbgen's method requires only bootstrap realizations of the sample covariance matrix, which could be formed by composite LD calculations on bootstrapped genotypes. A key quantity, β, used in Dümbgen's method is of order (L/n)1/2, and a small value of β is desired to produce reasonably narrow intervals. After experimenting with Dümbgen's method, I concluded that samples larger than the TSC or HapMap samples would be needed for successful implementation.

In fitting the model to data, I have stayed within relatively strict limits on the number of parameters estimated, vis à vis the sample size, to avoid straining the statistical theory unduly. I view each sample covariance as an estimate of a population covariance parameter: thus the number of parameters of the model is a quadratic function of L, the number of SNPs. Admittedly, using the sample size as an upper bound on the number of parameters of the model seems severe, at a time when thousands of SNPs can be typed simultaneously in an individual. However, allowing the sample size and number of SNPs to be in the reverse relationship, n < L, calls upon multivariate theory still in early development.

Recently, Patterson et al. (2006) have also taken a multivariate approach to the analysis of population structure. A thorough comparison of the method I have described with that of Patterson et al. will require significant work, although a few points of comparison can be discerned. Patterson et al. note that blocks showing significant “standing” LD pose problems for their method, although they give an ad hoc strategy for handling this situation. The method I have described makes explicit use of such blocks, via the block-diagonal structure of Ψ. A key difference between the two approaches is that Patterson et al. decompose an n × n matrix of inter-individual covariances, whereas I decompose an L × L matrix of covariances between SNPs. Under circumstances described by Gower (1966) and Jolliffe (2002, Sect. 5.2), there is a duality relationship between subject-space and variable-space decompositions. If a model in variable space dual to the model of Patterson et al. can be described, it will be of interest to study the structure of the matrices analogous to Λ and Ψ. Notably, Patterson et al. have developed their approach with the relationship n < L specifically in mind. Although they use multivariate theory for n < L to the extent it is available, they necessarily rely on (well-argued) conjecture and numerical demonstration where theoretical results are lacking.

In my view, the relationship between n and L is connected to basic notions of degrees of freedom. The need for theoretical and analytic tools suitable for genomic data may prompt us to rethink how much can be inferred from a sample of fixed size. However, it seems prudent to admit from the start that the amount of information provided by a sample is limited and that the scope of inference should be limited accordingly.

I obtained the flat files TSCmap.p1.Celera.txt.gz and TSCmap.p2.Celera.txt.gz, from which the TSC data were extracted, at http://snp.cshl.org/. I obtained the flat files genotypes_chrX_CEU_r21_nr_fwd.txt.gz, genotypes_chrX_JPT+CHB_r21_nr_fwd.txt.gz, and genotypes_chrX_YRI_r21_nr_fwd.txt.gz, from which the HapMap build 35 data were extracted, at http://www.hapmap.org/. Python scripts used to perform data extraction and R scripts used to perform data analysis are available from the author.

Acknowledgments

Many thanks go to Chuck Langley; many of these ideas originated in discussions with him. Terry Speed and Matthew Stephens read early drafts of the manuscript, giving valuable comments and suggestions. Michael Browne, Lutz Dümbgen, Michael Perlman, and Yutaka Tanaka answered key questions about multivariate analysis. The comments of three anonymous reviewers on previous versions led to significant revisions. This work was supported by National Human Genome Research Institute (National Institutes of Health) grants 5R01HG002107-03 and 7R01MH60007.

APPENDIX

Proof of Proposition 1:

Note that the Schur product theorem (see, e.g., Horn and Johnson 1985, Sect. 7.5) cannot be used in the present case, as Rt0 is not assumed.

Because ΓΓ′ ≥ 0 by definition, for any yInline graphicL,

graphic file with name M141.gif (37)

Hence Inline graphic0. Let Db(k) be the covariance matrix for the loci of block b in the kth source population, and let Rt,b be the bth diagonal block of Rt, for b = 1 , … , B. As Db(k) ≥ 0 and q(k) > 0, Inline graphic as well. The nonnegative definiteness of Inline graphic is then established by considering yΨby in a manner analogous to (37), for Inline graphic. Finally, because each of the diagonal blocks Ψ1 , … , ΨB of Ψ is nonnegative definite, Ψ0 as well (see, e.g., Harville 1997, Lemma 14.8.3).

Proof of Proposition 2:

Inline graphic is symmetric and nonnegative definite, so only its rank is in question. The rank of Inline graphic is obtained by showing that Inline graphic and ΓΓ′, the latter of which has an easily determined rank, have the same null-space 𝒩.

For yInline graphicL, the lth element of the column vector Inline graphic satisfies the inequalities

graphic file with name M151.gif (38)

If y ∈ 𝒩(ΓΓ′), then

graphic file with name M152.gif

and (38) implies (Inline graphic)l = 0; l = 1, … L. Hence y ∈ 𝒩Inline graphic. By similar reasoning, if y ∈ 𝒩Inline graphic, then y ∈ 𝒩(ΓΓ′). 𝒩(Γ) consists of scalar multiples of the vector

graphic file with name M156.gif

hence rankInline graphic = rank(ΓΓ′) = rank(Γ) = K – dim[𝒩(Γ)] = K – 1.

Having established that Inline graphic is symmetric, nonnegative definite with rank K – 1, the existence of the decomposition Inline graphic = ΛΛ′ is a standard result (see, e.g., Harville 1997, Theorem 14.3.7).

Proof of Proposition 3:

The necessity of (25) for block-diagonal V implies that minimization of fV, for the unconstrained problem, need only be carried out on

graphic file with name M160.gif

Direct substitution of Ψ = B diag[SΛΛ′] in fV, along with the particular choice of V, yields a “concentrated” discrepancy function f* depending only on Λ:

graphic file with name M161.gif (39)

The GLS problem is now reduced to one of finding an Lx × Ly matrix Inline graphic of rank K – 1, which minimizes (39). The required matrix is obtained by rescaling the SVD of Inline graphic and retaining the components corresponding to the K – 1 largest singular values. Rao (1979) gives

graphic file with name M164.gif

which attains the global minimum

graphic file with name M165.gif

In current practice, the result is understood as an application of the Eckart–Young theorem in the metrics Sxx and Syy (see, e.g., Gower and Hand 1996, Sect. A.4.4). The factorization Inline graphic, with ΛInline graphic and ΛInline graphic as in (29), is somewhat arbitrary, but produces Λ* of the appropriate rank.

A modification of Rao's result (1973, Sect. 8f.3) for the statistical setting establishes that Ψ* ≥ 0, as follows. Because U is orthogonal, Sxx = SInline graphicUUSInline graphic, and hence

graphic file with name M171.gif

where Inline graphic are the columns of U and 1 ≥ r1 ≥ ··· ≥ rK–1 ≥ 0. Because Z ≥ 0, the product Inline graphic ≥ 0 as well (see, e.g., Harville 1997, Theorem 14.2.9). Nonnegative definiteness of ΨInline graphic is established using a similar argument, and together with ΨInline graphic ≥ 0 this implies Ψ* ≥ 0 (see, e.g., Harville 1997, Lemma 14.8.3).

Numerical calculation of the Rao decomposition:

The following calculations are adapted from Seber (1984, Sect. 5.7, 10.1.5). The QR decompositions of the centered data matrices are Inline graphic, Inline graphic, where Inline graphic, Inline graphic have orthonormal columns, and Inline graphic, Inline graphic are upper triangular with positive diagonal elements. The sample canonical correlations r1 ≥ ··· ≥ rd occur as the diagonal elements of Δ in the singular-value decomposition

graphic file with name M182.gif (40)

where Inline graphic, Inline graphic are orthogonal and Inline graphic is as in Equation 28 (Seber 1984, Sect. 10.1.5). Equation 40, along with some matrix algebra, yields the decomposition

graphic file with name M186.gif

which is equivalent to the decomposition of Sxy implied by (28). Parameter estimates equivalent to those of the Rao solution (29) are then given by

graphic file with name M187.gif (41)

References

  1. Anderson, T. W., 1984. An Introduction to Multivariate Statistical Analysis, Ed. 2. Wiley & Sons, New York.
  2. Bartholomew, D. J., and M. Knott, 1999. Latent Variable Models and Factor Analysis. Arnold, London.
  3. Browne, M. W., 1974. Generalized least-squares estimators in the analysis of covariance structures. South African Stat. J. 8: 1–24. [Google Scholar]
  4. Browne, M. W., 1979. The maximum-likelihood solution in inter-battery factor analysis. Br. J. Math. Statist. Psychol. 32: 75–86. [Google Scholar]
  5. Browne, M. W., 1980. Factor analysis of multiple batteries by maximum likelihood. Br. J. Math. Statist. Psychol. 33: 184–199. [Google Scholar]
  6. Browne, M. W., 1984. Asymptotically distribution-free methods for the analysis of covariance structures. Br. J. Math. Statist. Psychol. 37: 62–83. [DOI] [PubMed] [Google Scholar]
  7. Browne, M. W., and K. Tateneni, 1997. Noniterative estimation for the multiple battery factor analysis model. Behaviormetrika 24: 3–18. [Google Scholar]
  8. Cavalli-Sforza, L. L., and W. F. Bodmer, 1971. The Genetics of Human Populations. W. H. Freeman, San Francisco.
  9. Clark, A. G., R. Nielsen, J. Signorovitch, T. C. Matise, S. Glanowski et al., 2003. Linkage disequilibrium and inference of ancestral recombination in 538 single-nucleotide polymorphism clusters across the human genome. Am. J. Hum. Genet. 73: 285–300. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Cockerham, C. C., and B. S. Weir, 1977. Digenic descent measures for finite populations. Genet. Res. 30: 121–147. [Google Scholar]
  11. Crow, J. F., and M. Kimura, 1970. An Introduction to Population Genetics Theory. Burgess Publishing, Minneapolis.
  12. Davison, A. C., and D. V. Hinkley, 1997. Bootstrap Methods and Their Application. Cambridge University Press, Cambridge, UK.
  13. Dümbgen, L., 1998. Perturbation inequalities and confidence sets for functions of a scatter matrix. J. Multivariate Anal. 65: 19–35. [Google Scholar]
  14. Ekholm, A., P. W. F. Smith and J. W. McDonald, 1995. Marginal regression analysis of a multivariate binary response. Biometrika 82: 847–854. [Google Scholar]
  15. Ethier, S. N., 1979. A limit theorem for two-locus diffusion models in population genetics. J. Appl. Prob. 16: 402–408. [Google Scholar]
  16. Ewens, W. J., and R. S. Spielman, 1995. The transmission/disequilibrium test: history, subdivision, and admixture. Am. J. Hum. Genet. 57: 455–464. [PMC free article] [PubMed] [Google Scholar]
  17. Falush, D., M. Stephens and J. K. Pritchard, 2003. Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. Genetics 164: 1567–1587. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Fisher, R. A., 1921. On the probable error of a coefficient of correlation deduced from a small sample. Metron 1: 3–32. [Google Scholar]
  19. Gabriel, S. B., S. F. Schaffner, H. Nguyen, J. M. Moore, J. Roy et al., 2002. The structure of haplotype blocks in the human genome. Science 296: 2225–2229. [DOI] [PubMed] [Google Scholar]
  20. Gower, J. C., 1966. Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika 53(3–4): 325–338. [Google Scholar]
  21. Gower, J. C., and D. J. Hand, 1996. Biplots. Chapman & Hall, London.
  22. Gu, H., and W. K. Fung, 1998. Assessing local influence in canonical correlation analysis. Ann. Inst. Statist. Math. 50(4): 755–772. [Google Scholar]
  23. Hampel, F. R., E. M. Ronchetti, P. J. Rousseeuw and W. A. Stahel, 1986. Robust Statistics: The Approach Based on Influence Functions. Wiley & Sons, New York.
  24. Harville, D. A., 1997. Matrix Algebra From a Statistician's Perspective. Springer-Verlag, New York.
  25. Horn, R. A., and C. R. Johnson, 1985. Matrix Analysis. Cambridge University Press, Cambridge, UK.
  26. International HapMap Consortium, 2005. A haplotype map of the human genome. Nature 437(27): 1299–1320. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Jolliffe, I. T., 2002. Principal Component Analysis. Springer-Verlag, New York.
  28. Jöreskog, K. G., 1981. Analysis of covariance structures. Scand. J. Statist. 8: 65–92. [Google Scholar]
  29. Karlin, S., and J. McGregor, 1968. Rates and probabilities of fixation for two locus random mating finite populations without selection. Genetics 58: 141–159. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Krzanowski, W. J., and F. H. C. Marriott, 1995. Multivariate Analysis Part 2: Classification, Covariance Structures and Repeated Measurements. Arnold, London.
  31. Lawley, D. N., and A. E. Maxwell, 1971. Factor Analysis as a Statistical Method, Ed. 2. Butterworth, London.
  32. Lazarsfeld, P. F., and N. W. Henry, 1968. Latent Structure Analysis. Houghton Mifflin, Boston.
  33. Lee, S. Y., 1980. Estimation of covariance structure models with parameters subject to functional restraints. Psychometrika 45: 309–324. [Google Scholar]
  34. Lewontin, R. C., 1964. The interaction of selection and linkage. I. General considerations; heterotic models. Genetics 49: 49–67. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Lewontin, R. C., and K. Kojima, 1960. The evolutionary dynamics of complex polymorphisms. Evolution 14: 458–472. [Google Scholar]
  36. Magnus, J. R., and H. Neudecker, 1988. Matrix Differential Calculus With Applications in Statistics and Econometrics. Wiley & Sons, New York.
  37. Mardia, K. V., J. T. Kent and J. M. Bibby, 1979. Multivariate Analysis. Academic Press, London.
  38. Matise, T. C., R. Sachidanandam, A. G. Clark, L. Kruglyak, E. Wijsman et al., 2003. A 3.9-centimorgan-resolution human single-nucleotide polymorphism linkage map and screening set. Am. J. Hum. Genet. 72: 271–284. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. McDonald, R. P., and K. S. Ahlawat, 1974. Difficulty factors in binary data. Br. J. Math. Statist. Psychol. 27: 82–99. [Google Scholar]
  40. Muirhead, R. J., and C. M. Waternaux, 1980. Asymptotic distributions in canonical correlation analysis and other multivariate procedures for nonnormal populations. Biometrika 67: 31–43. [Google Scholar]
  41. Myers, S., L. Bottolo, C. Freeman, G. McVean and P. Donnelly, 2005. A fine-scale map of recombination rates and hotspots across the human genome. Science 310: 321–324. [DOI] [PubMed] [Google Scholar]
  42. Ohta, T., 1982. Linkage disequilibrium with the island model. Genetics 101: 139–155. [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Ohta, T., and M. Kimura, 1969. Linkage disequilibrium at steady state determined by random genetic drift and recurrent mutation. Genetics 63: 229–238. [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Patterson, N., A. L. Price and D. Reich, 2006. Population structure and eigenanalysis. PLoS Genetics 2: 2074–2093. [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Phillips, M. S., R. Lawrence, R. Sachidanandam, A. P. Morris, D. J. Balding et al., 2003. Chromosome-wide distribution of haplotype blocks and the role of recombination hot spots. Nat. Genet. 33: 382–387. [DOI] [PubMed] [Google Scholar]
  46. Pritchard, J. K., M. Stephens and P. Donnelly, 2000. Inference of population structure using multilocus genotype data. Genetics 155: 945–959. [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Prout, T., 1973. Appendix, pp. 494–496 in Mitton, J.B., and R.K.Koehn, 1973, Population genetics of marine pelecypods. III. Epistasis between functionally related isoenzymes of Mytilus edulis. Genetics 73: 487–496. [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. R Development Core Team, 2005. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna. http://www.r-project.org.
  49. Rao, C. R., 1973. Linear Statistical Inference and Its Applications, Ed. 2. Wiley & Sons, New York.
  50. Rao, C. R., 1979. Separation theorems for singular values of matrices and their applications in multivariate analysis. J. Multivar. Anal. 9: 362–377. [Google Scholar]
  51. Robinson, W. P., M. A. Asmussen and G. Thomson, 1991. Three-locus systems impose additional constraints on pairwise disequilibria. Genetics 129: 925–930. [DOI] [PMC free article] [PubMed] [Google Scholar]
  52. Romanazzi, M., 1992. Influence in canonical correlation analysis. Psychometrika 57: 237–259. [Google Scholar]
  53. Satten, G. A., W. D. Flanders and Q. Yang, 2001. Accounting for unmeasured population structure in case-control studies of genetic association using a novel latent-class model. Am. J. Hum. Genet. 68: 466–477. [DOI] [PMC free article] [PubMed] [Google Scholar]
  54. Seber, G. A. F., 1984. Multivariate Observations. Wiley & Sons, New York.
  55. Thomson, G., and M. Baur, 1984. Third order linkage disequilibrium. Tissue Antigens 24: 250–255. [DOI] [PubMed] [Google Scholar]
  56. Tucker, L. R., 1958. An inter-battery method of factor analysis. Psychometrika 23: 111–136. [Google Scholar]
  57. Weir, B. S., 1979. Inferences about linkage disequilibrium. Biometrics 35: 235–254. [PubMed] [Google Scholar]
  58. Weir, B. S., 1996. Genetic Data Analysis. Sinauer Associates, Sunderland, MA.
  59. Weir, B. S., and C. C. Cockerham, 1974. Behavior of pairs of loci in finite monoecious populations. Theor. Pop. Biol. 6: 323–354. [DOI] [PubMed] [Google Scholar]
  60. Wickens, T. D., 1995. The Geometry of Multivariate Statistics. Erlbaum Associates, Hillsdale, NJ.
  61. Wright, S., 1940. Breeding structure of populations in relation to speciation. Am. Nat. 74: 232–248. [Google Scholar]

Articles from Genetics are provided here courtesy of Oxford University Press

RESOURCES