Identifying Individuals in a Complex Mixture of DNA with Unknown Ancestry

Joshua Sampson; Hongyu Zhao

doi:10.2202/1544-6115.1469

. 2009 Sep 9;8(1):37. doi: 10.2202/1544-6115.1469

Identifying Individuals in a Complex Mixture of DNA with Unknown Ancestry

Joshua Sampson, Hongyu Zhao

PMCID: PMC2861329 PMID: 19799556

Abstract

A new test was recently developed that could use a high-density set of single nucleotide polymorphisms (SNPs) to determine whether a specific individual contributed to a mixture of DNA. The test statistic compared the genotype for the individual to the allele frequencies in the mixture and to the allele frequencies in a reference group. This test requires the ancestries of the reference group to be nearly identical to those of the contributors to the mixture. Here, we first quantify the bias, the increase in type I and type II error, when the ancestries are not well matched. Then, we show that the test can also be biased if the number of subjects in the two groups differ or if the platforms used to measure SNP intensities differ. We then introduce a new test statistic and a test that only requires the ancestries of the reference group to be similar to the individual of interest, and show that this test is not only robust to the number of subjects and platform, but also has increased power of detection. The two tests are compared on both HapMap and simulated data.

1. Introduction

Given a mixture of DNA samples from numerous individuals, it is often desirable to determine whether a specific individual contributes DNA to that mixture. Using forensics as an example, this mixture can be a specimen from a crime scene, and the goal can be determining whether a suspect’s DNA is included in that specimen. Many methods have been proposed to identify the presence of an individual within a mixture. Most of them focus on cases where only a few people contribute to the mixture. These methods usually compare short tandem repeats (STR) in the mixture to those in the individual (Fung and Hu, 2002; Balding, 2003; Foreman et al., 2003). When only interested in males, the comparison can be limited to STRs on the Y chromosome (Jobling and Gill, 2004). In cases where the DNA has degraded, a second approach, comparing Mitochondrial DNA (mtDNA), specifically the hypervariable region, between mixture and individual can be a better alternative (Stoneking et al., 1991). The limitations of each method were discussed in Homer et al. (2008), but in brief, methods based on STRs require the DNA to be in good condition and methods based on mtDNA, even when augmented by informative SNPs, can have limited discriminatory power (Homer et al., 2008).

In a ground-breaking paper, Homer et. al. propose a new method using a genome-wide set of SNPs (Homer et al., 2008) to identify an individual in a mixture. Their method compares ${\vec{Y}}_{M} = {Y_{M 1}, \dots, Y_{M N}}$ , where Y_{M j} is the proportion or frequency of the “A” allele in the mixture at SNP j, with ${\vec{Y}}_{0} = {Y_{01}, \dots, Y_{0 N}}$ , where Y₀ _j is the proportion of the “A” allele in the individual. Clearly, the possible values of Y₀ _j are 0, 0.5, and 1. Note that we have assumed there are N SNPs and have labeled the two possible alleles at each SNP as A and B. Not only is their new method based on a more robust technique, it has higher resolution, able to identify an individual in a mixture containing thousands of subjects. This method also highlighted the potential for a new problem, important enough to be discussed in Science (Couzin, 2008) and require an immediate statement by the NIH. Their results proved that the practice of publicly releasing group-level results from case/control GWAS studies may jeopardize participants’ anonymity.

The test statistic, T, proposed by Homer et. al., compares the similarity between ${\vec{Y}}_{0}$ and ${\vec{Y}}_{M}$ to the similarity between ${\vec{Y}}_{0}$ and ${\vec{Y}}_{R}$ , where ${\vec{Y}}_{R}$ are the allele frequencies in a reference mixture. If T is large enough, the corresponding test will reject the null hypothesis that the individual of interest does not contribute DNA to the mixture. For this test to perform well, the subjects for the reference group must be carefully chosen so that their ancestral composition matches that of the subjects contributing to the mixture (e.g. if Caucasians contribute to the mixture, then the reference group must also be Caucasian), or only a select subgroup of ancestry independent SNPs can be used (Kidd et al., 2006). The need for similar reference groups is clear. Assume we fail to select a similar reference group, and the individual is more similar, in terms of ancestry, to those subjects contributing to the mixture. Then, even if the subject’s DNA is absent from the mixture, ${\vec{Y}}_{0}$ will be more similar to ${\vec{Y}}_{M}$ , T will be large, and the test will likely result in a false positive. Similarly, if the individual and the reference group are more similar, the test can lead to a false negative. The obvious problem is that the identities, and therefore the ancestries, of the individuals in the mixture are unknown, and this required matching can be very difficult.

The first goal of this paper is to quantify the magnitude of the bias, type I error, and type II error that can occur if the ancestries of the two groups are poorly matched. Then we identify two other possible sources of bias, the type of platform (e.g. Illumina, Affymetrix) and the number of subjects in the reference group. We use HapMap data to further quantify the extent of the bias in real samples (The International HapMap Consortium, 2003). After demonstrating the severity of the potential problems caused by using T, we propose a new statistic that only requires selecting a reference group with an ancestry that matches the ancestry of the individual of interest. This is a far simpler task, as the individual’s ancestry is usually known. Other benefits of the new statistic are that it is unbiased and has a known null distribution. We then compare the performance of the two statistics using simulated data. The paper is therefore ordered as follows. Section II starts by introducing notation and then continues with a discussion of the properties of both statistics and their associated tests. Section III demonstrates the performance of the statistics and their associated tests using both HapMap and simulated data. Finally, section IV contains our brief concluding remarks.

2. Methods

2.1. Notation

Let the individual of interest be indexed by 0, and let the n_R subjects in the reference population and the n_M subjects in the mixture be indexed by {1, ...,n_R + n_M}. For subject i, i ∈ {0, 1, ...,n_R + n_M}, let R_i = 1 (M_i = 1) if subject i is in the reference group (mixture), R_i = 0 (M_i = 0) otherwise. Unless explicitly stated otherwise, order the subjects so the first n_M belong to the mixture (i.e. M_i = 1 for 1 ≤ i ≤ n_M). Note that we use the shorthand that “subject i is in the mixture” or “subject i is in the mixture group” if subject i contributes DNA to the mixture. Let e_i be the ancestry, or ethnicity, of subject i. We start by assuming homogenous groups and therefore e_i ∈ {e₀, e_M, e_R}, with e_i = e_R if R_i = 1, and e_i = e_M if M_i = 1.

Let there be N SNPs in the study. Let Q_{i j}₁ and Q_{i j}₂ indicate whether the minor allele (relative to ethnicity e₀) is on chromosome 1 and 2, respectively, for subject i at SNP j. Let Y_{i j} = 0.5 (Q_{i j}₁ + Q_{i j}₂) be the genotype for subject i at SNP j; Y_{i j} ∈ {0, 0.5, 1}. We define the population allele frequencies by p_{M j} ≡ P(Q_{i jk} = 1 | M_i = 1, i > 0), p_{R j} ≡ P(Q_{i jk} = 1 | R_i = 1, i > 0), and p₀ _j ≡ P(Q₀ _jk = 1) where k ∈ {1, 2}. For calculating genotype probabilities, we assume Hardy-Weinberg Equilibrium.

Most genotyping platforms directly measure fluorescent intensity, which should be proportional to allele frequency. We shall label the intensity measures for allele A and allele B at SNP j from the mixture group as I_{AM j} and I_{BM j}. These intensities are measurements from a “pooled sample”. We label the ratio of intensities by γ_{M j} = I_{AM j} / (I_{AM j} + I_{BM j}). If intensity measurements are also available for individuals, then we can calculate a constant k_{M j} using known formulas (Pearson et al., 2007; Macgregor et al., 2008) and use a better proxy for the A allele frequency, γ_{M j} = I_{AM j} / (I_{AM j} + k_{M j}I_{BM j}). Similarly, define {I_ARj, I_BRj, γ_{R j}, k_{R j}} for the reference group. Therefore, the distance measure at each SNP from the Homer et. al. paper can be defined as

D_{L_{1}} (j) = | Y_{0 j} - γ_{R j} | - | Y_{0 j} - γ_{M j} |

(1)

Their test statistic can then be defined as

T_{L_{1} =} \frac{\sqrt{N} (D_{L_{1}} - μ_{0})}{\sqrt{N^{- 1} \sum_{j} {(D_{L_{1}} (j) - D_{L_{1}})}^{2}}}

(2)

where $D_{L_{1}} \equiv \frac{1}{N} \sum_{j} D_{L_{1}} (j)$ and μ₀ = 0. In this manuscript, we will also discuss

D_{L_{2}} (j) = {(Y_{0 j} - γ_{R j})}^{2} - {(Y_{0 j} - γ_{M j})}^{2}

(3)

and the corresponding values D_L_₂ and T_L_₂. The “L₁” and “L₂” subscripts emphasize the distance measure used for the statistic. We discuss the case when intensities perfectly reflect underlying allele frequencies and define X (j) ≡ | Y₀ _j – p̂_{R j} | – | Y₀ _j – p̂_{M j}|, where ${\hat{p}}_{M j} \equiv n_{M}^{- 1} \sum_{i : M_{i} = 1} Y_{i j}$ and ${\hat{p}}_{R j} \equiv n_{R}^{- 1} \sum_{i : R_{i} = 1} Y_{i j}$ .

In general, we assume that γ_{M j} and γ_{R j} can be described by the following statistical models:

\begin{array}{l} γ_{M j} = \frac{\sum_{i : M_{i} = 1} Y_{i j}}{n_{M}} + β_{M j} + ε_{M j} = {\hat{p}}_{M j} + β_{M j} + ε_{M j} \\ γ_{R j} = \frac{\sum_{i : R_{i} = 1} Y_{i j}}{n_{R}} + β_{R j} + ε_{R j} = {\hat{p}}_{R j} + β_{R j} + ε_{R j} \end{array}

(4)

where $ε_{M j} \sim N (0, σ_{M j}^{2})$ , $ε_{R j} \sim N (0, σ_{R j}^{2})$ , and {β_{M j}, β_{R j}} are SNP/platform specific biases. Therefore, the error is now linearly associated with the ratio of intensities, instead of the intensities themselves. As studies tend to exclude SNPs with rare minor alleles, we assume that 0 ≤ γ_{M j}, γ_{R j} ≤ 1 will generally hold. This description of γ_{M j} and γ_{R j} as normal variables is a simplification, but we use it only to show the existence of and then approximate expected biases. Note that the properties of our suggested test statistic do not depend on the validity of equation 4. We define ${\vec{β}}_{M} = {β_{M 1}, \dots, β_{M N}}$ , ${\vec{β}}_{R} = {β_{R 1}, \dots, β_{R N}}$ , ${\vec{σ}}_{M}^{2} = {σ_{M_{1}}^{2}, \dots, σ_{M N}^{2}}$ , and ${\vec{σ}}_{R}^{2} = {σ_{R_{1}}^{2}, \dots, σ_{R N}^{2}}$ . Finally, we use the abbreviation Θ to represent the set of all parameters, $Θ = {n_{M}, n_{R}, {\vec{β}}_{M}, {\vec{β}}_{R}, {\vec{σ}}_{R}^{2}, {\vec{σ}}_{M}^{2}, e_{0}, e_{M}, e_{R}}$ .

2.2. D_L_₁ : The Original Test

The definition of any test requires a statement of the null hypothesis and the list of outcomes that would lead to the rejection of that null. Therefore, the desired null hypothesis is that the individual of interest is not in the mixture, M₀ = 0, and the original test is to reject this null when T_L_₁, with μ₀ = 0, is large. The next goal is to calculate the appropriate threshold, type I error rate, and power. Unfortunately, T_L_₁ is not a pivotal statistic, in that its distribution still depends on the parameter set Θ. Stating M₀ = 0 does not lead to a single distribution of T_L_₁. Therefore, for the original paper to discuss the type I error rate and power, there needed to be three additional assumptions

Identical Ethnicities: e₀ = e_M = e_R
Identical Sample Sizes: n_M = n_R
Identical Platforms: β_{M j} = β_{R j} and $σ_{M j}^{2} = σ_{R j}^{2} \forall j$

With these added assumptions, T_L_₁ no longer depends on Θ and they could choose a threshold t_α so that the type I error of the test would be

P (T_{L_{1}} > t_{α} | M_{0} = 0, e_{0} = e_{M} = e_{R}, n_{M} = n_{R}, {\vec{β}}_{M} = {\vec{β}}_{R}, {\vec{σ}}_{M}^{2} = {\vec{σ}}_{R}^{2}) = α

(5)

The power was then calculated as

P (T_{L_{1}} > t_{α} | M_{0} = 1, e_{0} = e_{M} = e_{R}, n_{M} = n_{R}, {\vec{β}}_{M} = {\vec{β}}_{R}, {\vec{σ}}_{M}^{2} = {\vec{σ}}_{R}^{2}) = β_{P O W}

(6)

Unfortunately, if one or more of these three assumptions are false, the type I error rate and power can differ from α and β_POW. Here, we consider a test to be biased for a given Θ if P(T_L_₁ > t_α |M₀ = 0, Θ) ≠ α or P(T_L_₁ > t_α |M₀ = 1, Θ) ≠ β_POW. In the following sections, we show the magnitude of the bias when each of the assumptions is violated.

2.2.1. D_L_₁ : Bias from Ancestry

In this section, we show that unless the ancestries of the reference and mixture groups are extremely well matched, the false positive rate can easily be near 1, as opposed to the predicted α. Ignoring the sample size and platform parameters, there are seven possible scenarios (Table 1). The original paper thoroughly discussed scenarios H₀ and H₁ with assumptions II and III holding, but only warned of potential bias in the other cases (i.e P(T_L_₁ > t_α |M₀ = 0, e_M, e_R, e₀) ≠ α and P(T_L_₁ > t_α |M₀ = 1, e_M, e_R, e₀) ≠ β_POW for all combinations of e₀, e_M, and e_R). Here, we want to estimate the magnitude of the type I error for the scenario H_{f p} where the result is likely to be a false positive (Table 1). Under H_{f p}, the individual is not in the mixture, but only the ancestries of the individual of interest and the mixture are the same (e₀ = e_M ≠ e_R).

Table 1:

There are seven scenarios depending on whether the individual of interest is in the mixture (column 2) and depending on whether the ethnicity of the individual of interest (e₀), the ethnicity of the mixture group (e_M), and the ethnicity of the reference group (e_R) are the same (columns 3,4,5). ‘Yes’ indicates the statement at the top of the column is true and a blank indicates the statement is false. For example, in scenario 1, which we label as “H₀”, the individual of interest is not in the mixture (the statement M₀ = 1 is false), the ethnicities of the individual and mixture are the same (e₀ = e_M), the ethnicities of the individual and reference are the same (e₀ = e_R), and the ethnicities of the mixture and reference are the same (e_R = e_M). In this case, all individuals share a common ethnicity. Scenarios 1 and 2 were studied by Homer et al. (2008). Here, we also study scenario 3, which we label as H_{f p} because it is likely to lead to a false positive result.

		Ethnicities
	M₀ = 1	e₀ = e_M	e₀ = e_R	e_R = e_M
1 (H₀)		Yes	Yes	Yes
2 (H₁)	Yes	Yes	Yes	Yes

3 (H_{f p})		Yes
4			Yes
5				Yes
6
7	Yes	Yes

Open in a new tab

Estimating the true type I error rate requires understanding the distribution of T_L_₁, or equivalently, D_L_₁ (j), for the three scenarios, H₀, H₁, and H_{f p}. As the allele frequencies at many of the SNPs will likely be the same for all populations, p₀ _j = p_{R j} = p_{M j}, the type I error rate will depend on s, defined to be the proportion of all SNPs that are ancestry independent, and the differences between p_{R j} and p_{M j} for the other (1 – s)N SNPs. To start, we calculate the asymptotic distribution of D_L_₁ (j) (see appendix 5.1), the fundamental component of T_L_₁, in all three scenarios. When n is large, n ≡ n_M = n_R, and β_{R j} = β_{M j} ∀ j,

\begin{array}{r} H_{0} : D_{L_{1}} (j) \sim N (μ_{0} (j), σ^{2} (j)) \\ H_{1} : D_{L_{1}} (j) \sim N (μ_{1} (j), σ^{2} (j)) \\ H_{f p} : D_{L_{1}} (j) \sim N (μ_{f p} (j), σ_{f p}^{2} (j)) \end{array}

(7)

where

\begin{array}{l} μ_{0} (j) = 0 \\ μ_{1} (j) = \frac{1}{n} 2 p_{M j} {(1 - p_{M j})}^{2} \\ μ_{f p} (j) = (p_{M j} - p_{R j}) (1 - 2 {(1 - p_{M j})}^{2}) if p_{R j} < 0.5 \\ - p_{M j} (2 p_{M j}^{2} - 6 p_{M j} + 3) + p_{R j} (1 - 2 p_{M j}^{2}) if p_{R j} > 0.5 \\ σ^{2} (j) = \frac{p_{M j} (1 - p_{M j})}{n} + σ_{M j}^{2} + σ_{R j}^{2} \\ σ_{f p}^{2} (j) = \frac{1}{2 n} p_{M j} (1 - p_{M j}) + \frac{1}{2 n} p_{R j} (1 - p_{R j}) + σ_{R j}^{2} + σ_{M j}^{2} + \\ {(p_{M j} - p_{R j})}^{2} 4 p_{M j} (2 - p_{M j}) {(1 - p_{M j})}^{2} \end{array}

(8)

Figure 1 shows the type I error rate as a function of s assuming that p_{M j} is uniformly distributed over the interval (0, 0.5), p_{R j} = p_{M j} at SNPs 1,...,sN, and p_{R j} is uniformly distributed over the interval (0,0.5) for the remaining (1-s)N SNPs independent of p_{M j}. This will overestimate the type I error rate for a given s, because, in truth, p_{R j} will often be relatively close to p_{M j}. Nevertheless, this simplified example demonstrates the magnitude of the type I error rate. Figure 1 shows that when n_M is large (i.e. n_M = 1000), over 99% of all SNPs would need to be ancestry independent to prevent the false positive rate from being near 1. Therefore, the false positive rate can easily equal or exceed the power of the test. Next, we can calculate the value of s needed for the false positive rate to equal the power, which is essentially when μ₁ =μ̄_{f p}, where μ₁ ≡ E[μ₁ (j)] and μ̄_{f p} ≡ (1 – s) E[μ_{f p} (j)]. In appendix 5.2, assuming the allele frequencies have the above uniform distributions, we show that μ₁ = 0.23/n and μ̄_{f p} = 0.062(1 – s). Therefore, if 1 – s = 3.7/n, μ₁ = μ̄_{f p}. If our mixture contained 1000 subjects, the reference and mixture groups would only need to differ at 0.37% of all SNPs for the false positive rate to exceed the power. Had we allowed the allele frequencies for those (1-s)N SNPs to be randomly distributed over the interval [0, 1], even fewer SNPs would need to differ. Here, μ₁ = μ̄_{f p} if 1 – s = 1.2/n, and only 0.12% of all SNPs would need to differ.

2.2.2. D_L_₁ : Bias from Sample Size

In this section, we show that unless the number of subjects in the mixture and reference groups are equal, either type I or type II error can be higher than expected. Here, we keep assumptions I and III, identical ancestries and platforms, and examine the test’s bias if n_M and n_R are finite and unequal. Before discussing the origin and extent of this bias, we state a useful fact, which is proven in appendix 5.3.

Inequality of Absolute Values:

Let p < 0.5, $p_{1} \sim N (p, σ_{1}^{2})$ , $p_{2} \sim N (p, σ_{2}^{2})$ , and p₁ ⊥ p₂.

If $σ_{1}^{2} > σ_{2}^{2}$ , then E [|0.5 – p₁|] > E [|0.5– p₂|]

Our goal is to show that if n_M ≠ n_R, then E[D_L_₁ (j)] ≠ 0 when M₀ = 0, even under otherwise ideal conditions. This is demonstrated empirically by figure 2. A more complete discussion follows. Without loss of generality, assume that we have a large reference sample and n_R > n_M. Then var(p̂_{M j}) > var (p̂_{R j}). Because p̂_{R j} is a more precise estimate of p, the expected L₁ distance between Y₀ _j and p̂_{R j} tends to be smaller than the expected distance between Y₀ _j and p̂_{M j}, or, equivalently, E[D_L_₁ (j)] < 0. This would be a trivial statement if distance were measured in L₂. For L₁, we will show that E[X_j] < 0, which is essentially equivalent to E[D_L_₁ (j)] < 0. Recall, X (j) ≡ | Y₀ _j – p̂_{R j} | – | Y₀ _j – p̂_{M j} |. To use the Inequality of Absolute Values, we start by assuming ${\hat{p}}_{R j} \sim N (p, n_{R}^{- 1} σ_{p}^{2})$ and ${\hat{p}}_{M j} \sim N (p, n_{M}^{- 1} σ_{p}^{2})$ , and decompose X (j) as

X (j) = X_{j 1} 1 (Y_{0 j} = 1) + X_{j 0.5} 1 (Y_{0 j} = 0.5) + X_{j 0} 1 (Y_{0 j} = 0)

(9)

where

\begin{array}{c} X_{j 1} = | 1 - {\hat{p}}_{R j} | - | 1 - {\hat{p}}_{M j} | \\ X_{j 0.5} = | 0.5 - {\hat{p}}_{R j} | - | 0.5 - {\hat{p}}_{M j} | \\ X_{j 0} = | 0 - {\hat{p}}_{R j} | - | 0 - {\hat{p}}_{M j} | \end{array}

(10)

It is immediately clear that when the individual of interest is not in the mixture, E[X_j₁] = E[p̂_{R j}] – E[p̂_{M j}] = 0 and E[X_j₀] = E[p̂_{R j}] – E[p̂_{M j}] = 0. With our assumptions of normality and the above inequality, E[X_j_0.5] < 0. Consequently, E[X_j] < 0. For an even more intuitive understanding, let n_R ≈ ∞ and p̂_{R j} = p_{R j} ≡ 0.5 Then, clearly |0.5 – p̂_{R j} | = 0, and X_j_0.5 < 0.

Figure 2: — E[D_L_₁ (j)] ≠ 0 even when p_{R j} = p_{M j} ≡ p. In this example, where we let n_R = ∞ (p̂*_{R j}* = p), we find that E[D_L_₁ (j)] decreases as the number of individuals in the mixture shrinks toward 0. We plot E[D_L_₁ (j)] (y-axis) vs n_M (x-axis) for multiple values of p.

2.2.3. D_L_₁: Bias from Platform

In this section, we show that unless the allele intensities for the mixture and reference group are measured on the same platform, the test may be biased, often leading to an increase in type II error. Normally, we need not concern ourselves with platform bias, as we can ensure that both samples will be measured on the same platform. Unfortunately, this is not the case here. For the mixture, we have a sample of DNA and we can choose our preferred platform. For the reference group, we often do not have access to actual samples of DNA, and instead base our estimates of allele frequencies on previously recorded genotypes for a group of individuals. Comparing γ_{M j} and $n_{R}^{- 1} \sum_{i = R_{i} = 1} Y_{i j}$ is equivalent to comparing allele frequencies measured on two different platforms. Also, even if the reference sample were an actual mixture of DNA, that sample might no longer be available for analysis.

First, if β_M ≠ β_R, then E[D_L_₁] ≠ 0 when M₀ = 0, even under otherwise ideal conditions. Simple algebra shows that if model (4) were true,

E [D_{L_{1}} (j)] \approx (β_{M j} - β_{R j}) (1 - 2 {(1 - p_{M j})}^{2})

(11)

Here, we presume E[D_L_₁ (j)] > 0 implies E[D_L_₁] > 0, or that $\sum_{j = 1}^{N} β_{M j} \neq \sum_{j = 1}^{N} β_{R j}$ . Fortunately, this bias can be easily removed by substituting $γ_{M j}^{*} \equiv γ_{M j} - {\hat{β}}_{M j}$ and $γ_{R j}^{*} \equiv γ_{R j} - {\hat{β}}_{R j}$ into the equations for D_L_₁ (j), where {β̂_{M j}, β̂_{R j}} is an unbiased approximation of {β_{M j}, β_{R j}}, or by using the appropriate constants k_{M j} and k_{R j} for calculating γ_{M j} and γ_{R j}.

However, this does not eliminate the potential bias caused by platform. If $σ_{M j}^{2} \neq σ_{R j}^{2}$ , then E[| Y₀ _j – γ_{R j} | – | Y₀ _j – γ_{M j} |] ≠ 0. The origin of this confounding is identical to that for the n_M ≠ n_R case. Here, we would assume that ${\hat{p}}_{M j} + ε_{M j} \sim N (p, σ_{1}^{2})$ and ${\hat{p}}_{R j} + ε_{R j} \sim N (p, σ_{2}^{2})$ . Next, note that even if E[β̂_{M j}] = β_{M j}, E[β̂_{R j}] = β_{R j}, and $σ_{M j}^{2} = σ_{R j}^{2}$ , there can still be bias if var (β̂_{M j}) ≠ var (β̂_{R j}), as $E [| Y_{0 j} - γ_{R j}^{*} | - | Y_{0 j} - γ_{M j}^{*} |] \neq 0$ . The origin of the confounding is easily identified if we assume that ${\hat{p}}_{M j} + {\hat{β}}_{M j} \sim N (p, σ_{1}^{2})$ and ${\hat{p}}_{R j} + {\hat{β}}_{R j} \sim N (p, σ_{2}^{2})$ . The main importance of these statements is that the reference sample cannot simply be the average of known genotypes. In this case $σ_{R j}^{2} = 0$ and β_{R j} = 0 (i.e. var (β̂_{R j}) = 0). This would create a considerable test bias.

2.3. D_L_₂ : Switching from L₁ to L₂

Using the L₂ distance (recall D_L_₂ (j) = (Y₀ _j – γ_{R j})² – (Y₀ _j – γ_{M j})²) greatly simplifies calculations. Under the null hypothesis H₀, if we keep assumptions II and III, the mean and variance of D_L₂ (j) can be easily calculated as (see appendex 5.4):

\begin{array}{l} μ_{0} & = 0 \\ σ^{2} (j) & = 4 p_{M j}^{2} σ_{M j}^{2} + 4 p_{M j} σ_{M j}^{2} + 4 σ_{M j}^{4} - 8 p_{M j}^{2} σ_{M j}^{2} + \\ \frac{1}{n} (2 p_{M j}^{2} - 4 p_{M j}^{3} + 2 p_{M j}^{4} - 4 p_{M j}^{2} σ_{M j}^{2} + 4 p_{M j} σ_{M j}^{2}) + \\ \frac{1}{n^{2}} (p_{M j}^{2} - 2 p_{M j}^{2} + p_{M j}^{2}) + \\ \frac{1}{n^{3}} (\frac{p_{M j}}{4} - \frac{7}{4} p_{M j}^{2} + 3 p_{M j}^{3} - \frac{3}{2} p_{M j}^{4}) \end{array}

(12)

With the same assumptions, under the alternative hypothesis, H₁, the mean of D_L_₂ (j) is

μ_{1} (j) = \frac{1}{n} p_{M j} (1 - p_{M j})

(13)

and the variance can be approximated by σ² (j). In contrast to D_L_₁, we have estimated the parameters for the normal approximation of D_L_₂ without assuming large n. Equivalently, the approximation of $\sqrt{N D_{L_{2}}} \sim N (0, N^{- 1} \sum_{j} σ^{2} (j))$ requires only D_L_₂ (j₁) ⊥ D_L_₂ (j₂) ∀ j₁ ≠ j₂ and N being large. In addition to the ease of L₂, we found the statistic T_L_₂ to outperform T_L_₁. This can be verified by simulation (data not shown) or by comparing the two normal approximations for T_L_₁ and T_L_₂. In the online supplementary material, we compare the power of the statistics T_L_₁ and T_L_₂ for specific values of n, N, $\vec{p}$ , and σ², assuming independent SNPs and p_{M j} ∼ uniform[0,0.5]. The plots suggest a mild improvement using T_L_₂.

2.4. $D_{L_{2}}^{*}$ : A New Statistic

Our ideal goal would be to develop a pivotal statistic that, when M₀ = 0, is a) independent of the individual of interest’s ethnicity; b) independent of the number of individuals and the ethnicity of those individuals in the mixture; and c) independent of the platform chosen to analyze the mixture. Although we could find no such statistic, we can take advantage of being able to easily identify an individuals’ ethnicity by genotyping a small set of SNPs (< 0.01N). The previous requirement of identifying the ethnic composition of the mixture is a far more difficult task. Therefore, we introduce a new statistic that, given e₀, will have a N(0,1) distribution when M₀ = 0 regardless of the remaining parameters. The key difference in deriving this statistic is that the individuals in the reference group will be selected to have the same ancestry as the individual of interest, as opposed to the same ancestral composition as the mixture. Note that the small set of SNPs used to identify the individuals’ ancestries can be removed from the later analyses without greatly diminishing power. A suggested list of SNPs will be made available by the authors in the near future. Moreover, we found that the suggested statistic will lead to a test with increased power. The details of the statistic follow.

To describe the new statistic, we change the notation slightly as we no longer have Y_{i j} values for subjects in the mixture. Therefore, in addition to having γ_{M j} and Y₀ _j, 1 ≤ j ≤ N, choose n_R subjects for a reference group and let Y_{i j} be the genotype for subject i, 1 ≤ i ≤ n_R at SNP j. Note that i is bounded by n_R instead of n_R + n_M. Select the reference subjects so their ancestry is similar to that of the individual of interest (i.e. e_i = e₀ if R_i = 1).

Step 1:

Create n_R + 1 reference samples, ${\vec{γ}}_{R 0}$ , ${\vec{γ}}_{R 1}, \dots, γ_{R n_{R}}$ , where ${\vec{γ}}_{R k} = {γ_{R k 1}, \dots, γ_{R k N}}$ , k ∈ {0, ...,n_R} and

γ_{R k j} = \frac{\sum_{i = 0, i \neq k}^{n_{R}} Y_{i j}}{n_{R}}

(14)

Here, $σ_{R}^{2} = 0$ in model (4). We have immediately removed a source of variation.

Step 2:

Measure the distance between each individual and the appropriate reference group, (Y_{i j} – γ_{Ri j})², and compare the resulting value to the distance between that individual and the mixture

D_{i L_{2}}^{*} (j) = {(Y_{i j} - γ_{R i j})}^{2} - {(Y_{i j} - γ_{M j})}^{2}

(15)

Because the $σ_{R}^{2}$ term is absent from $D_{i L_{2}}^{*} (j)$ , $var (D_{i L_{2}}^{*} (j)) \leq var (D_{L_{2}} (j))$ . In fact,

var (D_{i L_{2}}^{*} (j)) = var (D_{L_{2}} (j)) - 2 σ_{R j}^{2} [p_{j} (1 - p_{j}) + \frac{p_{j} (1 - p_{j})}{n} + σ_{R j}^{2}]

(16)

All variance and covariance calculations will assume n ≡ n_R = n_M and p_j ≡ p_{R j} = p₀ _j = p_{M j}.

Step 3:

We then average those differences over all reference subjects,

{\bar{D}}_{L_{2}}^{*} (j) = \frac{\sum_{i = 1}^{n_{R}} D_{i L_{2}}^{*} (j)}{n_{R}}

(17)

to obtain an expected difference between the distance to the reference sample and the distance to the mixture, under the null hypothesis. The covariance between two terms in the sum is (see appendix 5.4)

\begin{array}{l} cov (D_{1 L_{2}}^{*} (j), D_{2 L_{2}}^{*} (j)) = \\ 2 σ_{M j}^{4} + \frac{(2 p_{j} - 2 p_{j}^{2}) σ_{M j}^{2}}{n} + \frac{\frac{1}{4} p_{j} - \frac{3}{4} p_{j}^{2} + 2 p_{j}^{3} - \frac{1}{2} p_{j}^{4}}{n^{3}} + \frac{\frac{- 1}{8} p_{j} + \frac{11}{8} p_{j}^{2} - \frac{5}{2} p_{j}^{3} + \frac{11}{4} p_{j}^{4}}{n^{4}} \end{array}

(18)

Each term on the right side of equation 18 must be positive, so the covariance must be positive (i.e. $cov (D_{1 L_{2}}^{*} (j), D_{2 L_{2}}^{*} (j)) > 0$ ).

Step 4:

Compare $D_{0 L_{2}}^{*} (j)$ to this averaged value

D_{L_{2}}^{*} (j) = D_{0 L_{2}}^{*} (j) - {\bar{D}}_{L_{2}}^{*} (j)

(19)

By noting the exchangeability between any two terms $D_{i_{1} L_{2}}^{*} (j)$ and $D_{i_{2} L_{2}}^{*} (j)$ , we can calculate the variance of $D_{L_{2}}^{*} (j)$

var (D_{L_{2}}^{*} (j)) = [1 + \frac{1}{n}] [var (D_{0 L_{2}}^{*} (j)) - cov (D_{1 L_{2}}^{*} (j), D_{2 L_{2}}^{*} (j))]

(20)

The key result is that, except for those values of $σ_{M j}^{*}$ where the power is essentially 1 regardless of the chosen statistic, $var (D_{L_{2}}^{*} (j)) < var (D_{L_{2}} (j))$ . As the means for the L₂ statistics are the same, the smaller variance suggests that our proposed statistic $T_{L_{2}}^{*}$ , based on $D_{L_{2}}^{*}$ , will usually have higher power than the original T_L_₂. This improvement is demonstrated in our simulations. Moreover, the statistic was designed so $E [D_{L_{2}}^{*} (j) | H_{0}] = 0$ if the ancestries of the reference group and individual of interest are matched correctly, regardless of platform and sample size.

Step 5:

Average over all SNPs to get

D_{L_{2}}^{*} = \frac{\sum_{j} D_{L_{2}}^{*} (j)}{N}

(21)

Because of the large number of SNPs, the CLT suggests that $\sqrt{N} D_{L_{2}}^{*} \sim N (0, σ_{D}^{2})$ under the null hypothesis. When allowing for dependency,

var (\frac{\sum_{j} D_{L_{2}}^{*} (j)}{\sqrt{N}}) = N^{- 1} \sum_{j} var (D_{L_{2}}^{*} (j)) + N^{- 1} \sum \sum_{j_{1} \neq j_{2}} cov (D_{L_{2}}^{*} (j_{1}), D_{L_{2}}^{*} (j_{2}))

(22)

Therefore, $σ_{D}^{2}$ can be estimated by

{\hat{σ}}_{D}^{2} = \frac{\sum_{j} {(D_{L_{2}}^{*} (j) - D_{L_{2}}^{*})}^{2}}{N} + \frac{\sum \sum_{j_{1} \neq j_{2}} (D_{L_{2}}^{*} (j_{1}) - D_{L_{2}}^{*}) (D_{L_{2}}^{*} (j_{2}) - D_{L_{2}}^{*})}{N}

(23)

In practice, we restrict the double sum to j₁ ≠ j₂ and | j₁ – j₂ | ≤ 500. In the on-line supplementary material, we use the HapMap samples to show $N^{- 1} \sum \sum_{| j_{1} - j_{2} | > 500} (D_{L_{2}}^{*} (j_{1}) - D_{L_{2}}^{*}) (D_{L_{2}}^{*} (j_{2}) - D_{L_{2}}^{*}) \approx 0$ . We now define our new statistic,

T_{L_{2}}^{*} = \frac{\sqrt{N} D_{L_{2}}^{*}}{\sqrt{{\hat{σ}}_{D}^{2}}}

(24)

and note that $T_{L_{2}}^{*} \sim N (0, 1)$ . For purposes of future comparisons, we also define $D_{L_{1}}^{*} (j)$ , $D_{L_{1}}^{*}$ , and $T_{L_{1}}^{*}$ by replacing the L₂ distance with the L₁ distance in $D_{L_{2}}^{*} (j)$ , $D_{L_{2}}^{*}$ , and $T_{L_{2}}^{*}$ .

Although $T_{L_{2}}^{*}$ appears to be one of the better performing statistics, it is not necessarily the most intuitive. We first tried a simpler alternative, $D_{L_{2}}^{s}$ , but as we discuss here, this statistic proved to have extremely low power. Let

D_{L_{2}}^{s} = \frac{\sum_{j = 1}^{N} D_{L_{2}}^{s} (j)}{N}

(25)

where

D_{L_{2}}^{s} (j) = {(Y_{0 j} - γ_{M j})}^{2} - \frac{\sum_{i = 1}^{n_{R}} {(Y_{i j} - γ_{M j})}^{2}}{n_{R}}

(26)

This alternative also has the desirable characteristics that neither platform nor sample size can invalidate the equalities E[(Y₀ _j – γ_{M j})²] = E[(Y_{i j} – γ_{M j})²] and $E [D_{L}^{s} (j_{2})] = 0$ , and that the ancestry of the reference group need only match that of the individual of interest. The problem with such a simple approach is that the $var (D_{L_{2}}^{s}) \approx n \times var (D_{L_{2}})$ . To understand the origin of this vast difference between variances, we turn to a simple example, where Y_{i j} ∈ {0, 1} and P(Y_{i j} = 1) = p_jM. Here, $D_{L_{2}} (j) \in {2 (γ_{M j} - γ_{R j}) + γ_{R j}^{2} - γ_{M j}^{2}, γ_{R j}^{2} - γ_{M j}^{2}}$ , and the var(D_L_₂ (j)), under H₀, asymptotically approaches 0. In contrast, $D_{L_{2}}^{s} (j) \in {{(1 - γ_{M j})}^{2} + k ({\vec{Y}}_{R}, γ_{M j}), γ_{M j}^{2} + k ({\vec{Y}}_{R}, γ_{M j})}$ , where $k ({\vec{Y}}_{R}, γ_{M j})$ is the appropriate function of the reference group. Without the cancelation of the Y₀ _j terms in the statistic, asymptotically, $var (D_{L}^{s} (j_{2})) \to p_{j M} (1 - p_{j M}) (1 + 4 p_{j M}^{2})$ . Therefore, we designed $D_{L_{2}}^{*}$ to not only avoid potential bias, but retain the advantage gained by the cancelation of Y₀ _js in D_L_₁. Here we also note that the variance of D_L_₂ (j) or any similar statistic will be minimized when E[γ_{M j}] = E[γ_{R j}].

This same simple example, where Y_{i j} ∈ {0, 1} and P(Y_{i j} = 1) = p _jM, illustrates why the original statistic, D_L_₁ (j) = | Y₀ _j – γ_{M j} | – | Y₀ _j – γ_{R j} |, or any of our adaptations, needs to include the reference term, | Y₀ _j – γ_{R j} |. Under the null hypothesis, var(| Y₀ _j – γ_{R j} | – |Y₀ _j – γ_{M j} |) = 2p _jM (1 – p _jM) /n → 0. In contrast, without the cancelation of Y₀ _j terms, $var (| Y_{0 j} - γ_{M j} |) \to p_{j M} - 5 p_{j M}^{2} + 8 p_{j M}^{3} - 4 p_{j M}^{4}$ . Even if the ethnicities were well matched, the variance of the genome-wide statistic would be proportional to the sum of these variances over the entire genome, not just over the subset of informative SNPs. For choosing the best statistic, the variance will be the deciding value. The additional term | Y₀ _j – γ_{R j} | does not change the expected difference between the test statistic under the null and alternative hypotheses.

2.5. Data and Simulations

The power and type I error for the test based on T_L_₁ has already been thoroughly described when e₀ = e_R = e_M. Here, we use simulations to accomplish two goals. First, we show that tests based on $T_{L_{2}}^{*}$ are more powerful than tests based on T_L_₁. Second, we show that if the threshold t_α is chosen so that the type I error rate is α when e₀ = e_R = e_M (equation 5), but, in fact, assumption I is not true, then the true type I error rate can greatly exceed α.

The simulations generated distributions of T_L_₁, T_L_₂, $T_{L_{1}}^{*}$ , and $T_{L_{2}}^{*}$ for each of the three scenarios, H₀, H₁, and H_{f p}. Recall that the denominator for T_L_₁ was $\sqrt{N^{- 1} \sum_{j} {(D_{L_{1}} (j) - D_{L_{1}})}^{2}}$ , which assumes independence across SNPs. Therefore, in order to provide a fair comparison, we let the genotypes be independently distributed and let the denominator for $T_{L_{2}}^{*}$ be $\sqrt{N^{- 1} \sum_{j} {(D_{L_{2}}^{*} (j) - D_{L_{2}}^{*})}^{2}}$ .

First, we simulated 10,000 datasets (and 10,000 values of each test statistic) under H₀, with e₀ = e_M = e_R, n_M = n_R, and β_M = β_R = 0. Each dataset is simulated as follows. For each of 5000 SNPs (when n_M = 100) or 50,000 SNPs (n_M = 1000), we randomly generated p₀ _j, the frequency of allele ‘A’ for the individual of interest, from a uniform(0.1,0.5) distribution. Then, we set p_{M j} = p_R_₀_j = p_R_M _j = p₀ _j, where p_{M j}, p_R_₀ _j, and p_R_M _j are the ‘A’ allele frequencies for subjects in the mixture, reference group R₀ (selected to match the individual of interest), and reference group R_M (selected to match the mixture). Given the allele frequencies and assuming Hardy-Weinberg Equilibrium, we then generated the genotypes for the individual of interest ( $({\vec{Y}}_{0})$ ), the n_m individuals in the mixture (Y_M), and the n_R individuals in each reference group (Y_R_₀, Y_R_M). Given knowledge of the individuals in the mixture and reference groups, we generated the microarray intensities at each SNP as $γ_{M j} = n_{M}^{- 1} \sum_{i : M_{i} = 1} Y_{i j} + ε_{M j}$ and $γ_{R j} = n_{R}^{- 1} \sum_{R_{0 i} = 1} Y_{i j}$ , where $ε_{M j} \sim N (0, σ_{M}^{2})$ . Simulations were performed for a range of $σ_{M}^{2}$ values. The results are a single dataset ${{\vec{Y}}_{0}, {\vec{Y}}_{M}, Y_{R_{0}}, {\vec{Y}}_{R}}$ and a set of values ${T_{L_{1}}, T_{L_{2}}, T_{L_{1}}^{*}, T_{L_{2}}^{*}}$ . After simulating all 10,000 datasets under H₀, we denote the 99^th percentile of the four test statistics by t̂_αL_₁, t̂_αL_₂, ${\hat{t}}_{α L_{1}}^{*}$ , and ${\hat{t}}_{α L_{2}}^{*}$ .

Next, we simulated 1000 datasets under H₁. To create the datasets, we repeated the above steps, but let $γ_{M} (j) = n_{M}^{- 1} (Y_{0 j} + \sum_{i = 2}^{n M} Y_{i j}) + ε_{M j}$ . Then the power for the original test was the propotion of these 1000 T_L_₁ values exceeding t̂_αL_₁. A similar estimation of power was performed for T_L_₂, $T_{L_{1}}^{*}$ , and $T_{L_{2}}^{*}$ . Finally, we simulated datasets under H_{f p}. The type I error rate will depend on how the allele frequencies differ between the two ancestries. In one case, we assumed that a proportion, s, of the SNPs to be ancestry independent, (i.e. p_{M j} = p_R_₀_j = p_R_M _j = p₀ _j). For j > s × 5000 or j > s × 50, 000, we still enforced p_{M j} = p_R_₀_j = p₀ _j, but generated p_R_M _j from a uniform(0.1,0.5) distribution. In the second case, we assumed that the allele frequencies differed at all SNPs and let p_{R_M j} = (1 + k_j)^±1 p₀ _j, where {k₁, ..., k_N} was a vector of equally spaced points over the range 0 to 0.15, 0.20, and 0.25 when n_M = 100 and 0 to 0.03, 0.05, and 0.07 when n_M = 1000. The superscript ± 1 indicates that the exponent was randomly generated to be −1 or 1 with equal probability. Here, we can quantify the extent of the difference between the two ethnicities by the measure λ, the variance inflation factor typical in case/control studies. For n_M = 100, the three ranges correspond to a λ of 1.16, 1.36, and 1.65, and for n_M = 1000 the three ranges correspond to 1.09, 1.32, and 1.67. For each of these cases, we simulated 1000 datasets and defined the type I error rate to be the percentage of these 1000 T_L_₁ (T_L_₂) values exceeding t̂_αL_₁ (t̂_αL_₂).

To better quantify the type I error rate possible in real experiments, we used the 90 CEU and 45 Japanese individuals from the HapMap samples. For each CEU individual i, we selected 9 unrelated CEU individuals to create a positive mixture, $γ_{M +} = 0.1 (Y_{i} + \sum_{k = 1}^{9} {\vec{Y}}_{k}) + {\vec{ε}}_{i}$ , or 10 unrelated individuals to create a negative mixture, $γ_{M -} = 0.1 (\sum_{k = 1}^{10} {\vec{Y}}_{k}) + {\vec{ε}}_{i}$ , where ${\vec{ε}}_{i} \equiv {ε_{i j}, \dots, ε_{i N}}$ , ε_ij ∼ N(0,0.01²) and ε_{i j}_₁ ⊥ ε_{i j}_₂ if j₁ ≠ j₂. To achieve meaningful levels of power, we chose N = 1000. For each CEU individual, 11 reference groups were similarly created where γ_Rt included t Japanese individuals and 10 – t CEU individuals. We calculated the distribution of T_L_₁ (36000 simulations, 90 subjects × (400,000 SNPs/1000)=400 sets of SNPs) under H₀, H₁, and H_{f p}. We chose the threshold t̂_αL_₁ so that α = 0.05 if H₀ were true and then calculated the probability of rejection under H₁ (i.e. power) and under H_{f p} (i.e. type I error).

3. Results

3.1. Comparison of Power

Our first goal is to show that tests based on $T_{L_{2}}^{*}$ are more powerful than tests based on T_L_₁. As the results for the two sets of simulations, n_M = n_R = 1000 and n_M = n_R = 100 were similar, we limit our discussion to the latter. Recall, we chose to empirically estimate the 99^th percentile of the test statistics under H₀. This was a necessity because our reference mixture is an average of known genotypes, $σ_{M}^{2} \neq σ_{R}^{2}$ , and as discussed in section 2.2.3, T_L_₁ cannot be expected to follow a N(0, 1) distribution. Under H₀, when σ_M = 0.03, 0.05, 0.07, and 0.09, the averages of T_L₁ over the 10,000 simulations were −0.76, −1.71, −2.77, and −3.91 respectively. Similarly, the averages of T_L_₂ were −1.85, −4.13, −6.46, and −8.68 for those same values of σ_M. The empirical variances were close to 1. In contrast, $T_{L_{2}}^{*}$ followed a N(0, 1) distribution, as promised, and this empirical step was superfluous. The empirical means for $T_{L_{1}}^{*}$ and $T_{L_{2}}^{*}$ were near 0, –0.015 ≤ ${\bar{T}}_{L_{1}}^{*}$ ≤ –0.010 and –0.024 ≤ ${\bar{T}}_{L_{2}}^{*}$ ≤ –0.015, regardless of σ_M. Moreover, not only were the empirical variances near 1, the empirical 99^th percentiles were approximately 2.326 (2.29 ≤ ${\hat{t}}_{α L_{1}}^{*}$ ≤ 2.32 and 2.31 ≤ ${\hat{t}}_{α L_{2}}^{*}$ ≤ 2.32).

For each value of σ_M, we then calculated the power for the four tests, based on T_L_₁, T_L_₂, $T_{L_{1}}^{*}$ , and $T_{L_{2}}^{*}$ . The alternative hypothesis was H₁ with the other parameters unchanged. These values are plotted against signal noise or σ_M in Figure 3. Clearly, power is larger when the statistics are based on L₂ and when the reference group is matched to the individual of interest. Note, had we included an error term in the reference mixture (and possibly attained T_L_₁, T_L_₂ ∼ N (0, 1)), the benefit of using the new statistic would have been larger.

Figure 3: — Tests based on $T_{L_{2}}^{*}$ have the highest power. Power for tests based on the four statistics, T_L_₁, T_L_₂, $T_{L_{1}}^{*}$ , and $T_{L_{2}}^{*}$ , are plotted for multiple values of $σ_{M}^{2}$ or noise. Other parameters in the simulation were n_M = n_R = 100, N = 5000, s = 1, and β_M = β_R = 0.

3.2. Examination of False Positive Rate

Our second goal is to show that when assumption I is violated, the type I error rate can be much larger than α. Again, we presume t_α was selected so equation 5 holds. Simulations showed the quick increase in the false positive rate, P(T_L_₁ > t_α | H_{f p}), as the proportion of ancestry-independent SNPs decreased in those tests based on T_L_₁. Parameter values were chosen so that the ideal test (e₀ = e_M = e_R), for both n_M = 100 and n_M = 1000, would have had power = 0.87 to detect an individual’s presence. Clearly, unless s is close to 1, the false positive rate exceeds this estimated power. Because we have finite sample sizes and $σ_{M}^{2} \neq σ_{R}^{2}$ , we cannot expect the predicted power and false positive rate to be equal when 1 – s = 3.7/n (Appendix 5.2). However, the simulations can confirm that the two are approximately equal when μ₁ =μ_{f p} and that value of 1 – s which results in equality is proportional to 1/n. When the power and false positive rate cross in Figure 4 for n_M = 100, we find μ₁ = 2.40, μ_{f p} = 2.41, and 1 – s = 0.088. Similarly for n_M = 1000, we find μ₁ = 2.65, μ_{f p} = 2.64, and 1 – s = 0.0089. For simulations under the second set of p_{R_M j}, we found the false positive rates to be 0.11, 0.30, and 0.61 when n_M = 100 and 0.035, 0.15, and 0.51 when n_M = 1000.

Figure 4: — The type I error rate for tests based on T_L_₁ can exceed power. Type I error rate increases as the proportion, s, of ancestry independent SNPs decreases. Other parameters in the simulation were N=5000 (50,0000), $σ_{M}^{2} = {0.035}^{2} ({0.01}^{2})$ , and n_M = n_R = 100(1000).

The false positive rate for the HapMap samples were calculated when the mixture and reference groups each had 10 subjects. Both the individual of interest and the subjects in the mixture were from the CEU population. As the ratio of Japenese:CEU individuals increased in the reference group, the false positive rate increased from the α-level, 0.05, to near 1. When the ratio exceeded 8:2, the false positive rate exceeded the estimated power.

4. Discussion

As the overall popularity of SNP microarray technology increases and the cost of the technology decreases, there will likely be a shift from STRs to genome-wide sets of SNPs as the preferred method for DNA identification. Therefore, databases designed to store genetic identifiers for individuals will likely store genotypes for sets of SNPs in the future. Coupled with our earlier discussion, there will be three main advantages of using high-density SNPs to determine whether an individual contributes DNA to a mixture: accessibility, higher resolving power and the ability to work with low quality, degraded, DNA.

Here, we have demonstrated that tests based on T_L_₁ can suffer from inflated type I and type II errors if the mixture contains individuals with unknown ancestries. Therefore, we introduced a new test statistic, $T_{L_{2}}^{*}$ and an accompanying test that only require matching the ancestry of the reference group to that of the individual of interest. Even if the individual of interest has a mixture of ancestries, there should still be some subjects, with a similar mixture of ancestries, that can be used for comparison. This test is also robust to platform and sample size. We showed that both switching from the L₁ to the L₂ measure and switching to the new type of statistic increased the power to detect the individual of interest. Therefore, $T_{L_{2}}^{*}$ is not only more robust than T_L_₁, it tends to have increased power.

Supplementary Material

suppmaterial.pdf

additional power comparison

Figure 5: — The type I error rate for tests based on T_L_₁ can exceed power in simulations based on HapMap samples. The individual of interest and individuals in the mixture group were from the CEU population. The individuals in the reference group were a mixture from the CEU and Japanese populations. The type I error rate increased with the percentage of individuals in the reference group that were from the Japanese population. Other parameters in the simulation were N=1000, n_M = n_R = 10, and $σ_{M}^{2} = σ_{R}^{2} = {0.01}^{2}$

Acknowledgments

We would like to thank Dr. David Craig for providing us with data from his original article and to the anonymous reviewers for their helpful comments. This work was supported by NIH GM59507.

5. Appendix

5.1. Distribution of D_L₁ (j)

To estimate the distribution of D_L_₁ (j), we make the following assumptions, where g ∈ {R, M}:

Model (4) is true.
β_{R j} = β_{M j} = 0
n ≡ n_M = n_R

And restrict the SNPs examined to those satisfying
|p_{g j} – 0.5 | >δ

for some small δ such that
P(|p̂_{g j} – p_{g j} | > δ /2) ≈ 0.
P(|ε_{g j} | > δ /2) ≈ 0.

Assumptions 5 and 6 promise that n is large enough and the magnitude of ε_{g j} is small enough so that the signs inside the absolute values are determined by Y₀ _j – p_{g j}. For much of the exposition, we will still leave the β_{R j} and β_{M j} terms in the equations, although we assume they are small enough to not affect the sign inside the absolute values, to demonstrate their effects on bias and variance had they not been assumed to equal 0. To start, divide D_L_₁ (j) into three components

D_{L_{1}} (j) \approx D_{j 1} 1 (Y_{0 j} = 1) + D_{j 0.5} 1 (Y_{0 j} = 0.5) + D_{j 0} 1 (Y_{0 j} = 0)

(27)

where

\begin{array}{l} D_{j 1} \equiv \sum_{i = (n + 1)}^{2 n} \frac{1 - Y_{i j}}{n} - \frac{1 - Y_{1 j}}{n} - \sum_{i = 2}^{n} \frac{1 - Y_{i j}}{n} - β_{R j} - ε_{R j} + β_{M j} + ε_{M j} \\ D_{j 0.5} \equiv \sum_{i = (n + 1)}^{2 n} \frac{0.5 - Y_{i j}}{n} - \frac{0.5 - Y_{1 j}}{n} - \\ \sum_{i = 2}^{n} \frac{0.5 - Y_{i j}}{n} - β_{R j} - ε_{R j} + β_{M j} + ε_{M j} if p_{R j} < 0.5 \\ D_{j 0.5} \equiv \sum_{i = (n + 1)}^{2 n} \frac{Y_{i j} - 0.5}{n} - \frac{0.5 - Y_{1 j}}{n} - \\ \sum_{i = 2}^{n} \frac{0.5 - Y_{i j}}{n} + β_{R j} + ε_{R j} + β_{M j} + ε_{M j} if p_{R j} > 0.5 \\ D_{j 0} \equiv \sum_{i = (n + 1)}^{2 n} \frac{Y_{i j}}{n} - \frac{Y_{1 j}}{n} - \sum_{i = 2}^{n} \frac{Y_{i j}}{n} + β_{R j} + ε_{R j} - β_{M j} - ε_{M j} \end{array}

(28)

By definition p_{M j} < 0.5. We next calculate the expected values of each component under the two assumptions, H₁ and H_{f p}. We know that $E [Y_{i j}] = p_{g j}^{2} + p_{g j} (1 - p_{g j}) = p_{g j}$ , where g indicates group. Therefore, assuming scenario H₁, where without loss of generality, we let the individual of interest be subject 1 of the mixture,

E [D_{j 1} | H_{1}] = \frac{1}{n} - \frac{p_{M j}}{n} - β_{R j} + β_{M j}

(29)

E [D_{j 0.5} | H_{1}] = \frac{1}{2 n} - \frac{p_{M j}}{n} - β_{R j} + β_{M j}

(30)

E [D_{j 0} | H_{1}] = \frac{p_{M j}}{n} + β_{R j} - β_{M j}

(31)

We immediately see that

n E [D_{L_{1}} (j) | H_{1}] = 2 p_{M j} {(1 - p_{M j})}^{2} + n (β_{M j} - β_{R j}) (1 - 2 {(1 - p_{M j})}^{2})

(32)

Let us assume that β_{M j} = β_{R j}, then we get

\sum_{j} E [D_{L_{1}} (j) | H_{1}] = \frac{\sum_{j} 2 p_{M j} {(1 - p_{M j})}^{2}}{n}

(33)

Next we calculate the expected value under H _{f p},

\begin{array}{c} E [D_{j 1} | H_{f p}] & = & p_{M j} - p_{R j} - β_{R j} + β_{M j} \\ E [D_{j 0.5} | H_{f p}] & = & p_{M j} - p_{R j} - β_{R j} + β_{M j} if p_{R j} < 0.5 \\ E [D_{j 0.5} | H_{f p}] & = & p_{M j} + p_{R j} - 1 + β_{R j} + β_{M j} if p_{R j} > 0.5 \\ E [D_{j 0} | H_{f p}] & = & p_{R j} - p_{M j} + β_{R j} - β_{M j} \end{array}

(34)

We immediately see that

\begin{array}{l} \begin{matrix} E [D_{L 1} (j) | H_{f p}, p_{R j} < 0.5] & = & (p_{M j} - p_{R j} + β_{M j} - β_{R i}) (1 - 2 {(1 - p_{M j})}^{- 2}) \\ E [D_{L 1} (j) | H_{f p}, p_{R j} > 0.5] & = & - p_{M j} (2 p_{M j}^{2} - 6 p_{M j} + 3) + p_{R j} (1 - 2 p_{M j}^{2}) + \end{matrix} \\ β_{M j} (1 - 2 {(1 - p_{M j})}^{2}) + β_{R j} (1 - 2 p_{M j}^{2}) \end{array}

(35)

For simplification, let us assume that β_{M j} = β_{R j}, then we get

\begin{array}{l} \sum_{j} E [D_{L_{1}} (j) | H_{f p}] = \sum_{j : p_{R j} < 0.5} (p_{M j} - p_{R j}) (1 - 2 {(1 - p_{M j})}^{2}) + \\ \sum_{j : p_{R j} > 0.5} - p_{M j} (2 p_{M j}^{2} - 6 p_{M j} + 3) + p_{R j} (1 - 2 {(1 - p_{M j})}^{2} \end{array}

(36)

Now, we turn our attention to the variance of D_L_₁ (j) and recall that

var (D_{L_{1}} (j)) = E [var (D_{L_{1}} (j) | Y_{0 j})] + var (E [D_{L_{1}} (j) | Y_{0 j}])

(37)

We calculate the var(D_L_₁ (j) | Y₀ _j), assuming all individuals within a group are unrelated,

\begin{array}{l} E [var (D_{L_{1}} (j) | Y_{0 j})] = \frac{1}{2 n} p_{R j} (1 - p_{R j}) + \frac{var (Y_{1 j})}{n^{2}} + \\ \frac{n - 1}{2 n^{2}} p_{M j} (1 - p_{M j}) + σ_{R j}^{2} + σ_{M j}^{2} \end{array}

(38)

Under H₀, where E[D_L_₁ (j) | Y₀ _j] = 0 and p_{R j} = p_{M j}, we see that

var (E [D_{L_{1}} (j) | Y_{0 j}, H_{0}]) = [{(β_{M j} - β_{R j})}^{2} 4 p_{M j} (2 - p_{M j}) {(1 - p_{M j})}^{2}]

(39)

For simplification, we again assume that β_{M j} = β_{R j} and therefore

var (D_{L_{1}} (j) | H_{0}) = \frac{1}{n} p_{M j} (1 - p_{M j}) + σ_{R j}^{2} + σ_{M j}^{2}

(40)

Under H₁, we find that

E [var (D_{L_{1}} (j) | Y_{j 0}, H_{1})] = \frac{2 n - 1}{2 n^{2}} p_{M j} (1 - p_{M j}) + σ_{R j}^{2} + σ_{M j}^{2}

(41)

and for large n, we see that E[var(D_L_₁ (j) | Y_j₀, H₁)] ≈ E[var(D_L_₁ (j) | Y_j₀, H₀)]. Similarly, because E[D_L_₁ (j) | Y₀ _j, H₁] = E[D_L_₁ (j) |Y₀ _j, H₀] + O( $\frac{1}{n}$ ), then $var (E [D_{L_{1}} (j) | Y_{0 j}, H_{0}]) = var (E [D_{L_{1}} (j) | Y_{0 j}, H_{1}]) (1 + O (\frac{1}{n})) + O (\frac{1}{n^{2}})$ . Therefore, we can assume that var(D_L_₁ (j) | H₁) ≈ var(D_L_₁ (j) | H₀).

Next, we turn to scenario H_{f p}. Clearly,

E [var (D_{L_{1}} (j) | Y_{0 j}, H_{f p})] = \frac{1}{2 n} p_{M j} (1 - p_{M j}) + \frac{1}{2 n} p_{R j} (1 - p_{R j}) + σ_{R j}^{2} + σ_{M j}^{2}

(42)

To calculate the var(E[D_L_₁ (j) | Y₀ _j, H_{f p}]), we use the notation p_Δ_j = p_{M j} – p_{R j} and β_Δ_j =β_{M j} – β_{R j}, then under H_{f p}, if we let p_{R j} < 0.5,

\begin{array}{l} var (E [D_{L_{1}} (j) | Y_{0 j}, H_{f p}]) = \\ [{(p_{Δ j} - β_{Δ j})}^{2} 4 p_{M j} (2 - p_{M j}) {(1 - p_{M j})}^{2}] \end{array}

(43)

Now, if we assume that the minor allele in the mixture is the same as the minor allele in the reference group, or equivalently, that we have at least identified a reference population of moderately similar composition, then we can be satisfied that equation (43) is a satisfactory approximation of var(E[D_L_₁ (j) | Y₀ _j, H_{f p}]). Again, with the simplification that β_{M j} = β_{R j}, we find

\begin{array}{l} var (D_{L_{1}} (j) | H_{f p}) = \frac{1}{2 n} p_{M j} (1 - p_{M j}) + \frac{1}{2 n} p_{R j} (1 - p_{R j}) + σ_{R j}^{2} + σ_{M j}^{2} + \\ [{(p_{M j} - p_{R j})}^{2} 4 p_{M j} (2 - p_{M j}) {(1 - p_{M j})}^{2}] \end{array}

(44)

5.2. Estimating when μ₁ = μ̄_{f p}

Here, we provide a rough approximation of the percentage of SNPs that would need to be ancestry dependent for E[μ₁ (j)] = (1 – s) E[μ_{f p} (j)]. We base this approximation on the assumption that p_{M j} and p_{R j} are uniform random variables over the intervals [0,0.5] and [0,1] respectively. Start by calculating E[D_L_₁ (j) | p_{R j} < 0.5] from equation 35, letting β_{M j} = β_{R j} where the expectation is over p_{R j} and p_{M j}.

\begin{array}{l} E [D_{L_{1}} (j) | p_{R} < 0.5, H_{f p}] \\ = 4 \int_{0}^{0.5} \int_{0}^{0.5} (p_{M j} - p_{R j}) (4 p_{M j} - 1 - 2 p_{M j}^{2}) d p_{R j} d p_{M j} \\ = 4 \int_{0}^{0.5} \int_{0}^{0.5} (- 4 p_{R j} p_{M j} + 4 p_{M j}^{2} + p_{R j} - p_{M j} + 2 p_{M j} p_{M j}^{2} - 2 p_{M j}^{3}) d p_{R j} d_{M j} \\ = 0.062 \end{array}

(45)

Had we assumed p_{R j} > 0.5, E[D_L_₁ (j) | p_{R j} > 0.5, H_{f p}] = 0.32. Combining the two equalities, we find E[D_L_₁ (j) | H_{f p}] = 0.19. Obviously, from an evolutionary perspective, p_R and p_M should often be similar in value, but we ignore that here for simplicity. Furthermore, assume that only some proportion, 1 – s, of those N SNPs differ among ethnicities. Then E[μ _{f p} (j)] = 0.19 and μ̄ _{f p} ≡ (1 – s) E[μ_{f p} (j)] = 0.19(1 – s).

Next, from equation 33, we know nE [D_L_₁ (j)] = 2p_{M j} (1 – p_{M j})², and again assuming p_{M j} is a uniform random variable over the interval [0,0.5], we find E[D_L_₁ (j) | H₁] = .22/n. Therefore, across all SNPs, assuming no linkage, μ₁ ≡ E[μ₁ (j)] = .22/n.

Therefore, in order for μ₁ = μ̄ _{f p}, we would need .22/n = 0.19 (1 – s), which shows that false positives can be expected so long as the percentage of ancestry dependent SNPs, 1 – s, exceeds 1.16/n. Remember, that the difference between p_R and p_M is likely to be overstated by the above simplifications, and therefore this 1.16/n will be low.

5.3. Inequality of Absolute Values (proof)

Theorem: Let $X_{1} \sim N (p, σ_{1}^{2})$ , $X_{2} \sim N (p, σ_{2}^{2})$ , g(X) = | 0.5–X |, X₁ ⊥ X₂, $σ_{1}^{2} > σ_{2}^{2}$ , and p < 0.5. Then

E [g (X_{1})] > E [g (X_{2})]

(46)

Let Z₁ = F₁ (X₁) and Z₂ = F₂ (X₂), where F₁ and F₂ are their respective cumulative distribution functions. Then

\begin{matrix} E [g (X_{2})] - E [g (X_{1})] & = & \int_{- \infty}^{\infty} g (X_{2}) f_{2} (X_{2}) d X_{2} - \int_{- \infty}^{\infty} g (X_{1}) f_{1} (X_{1}) d X_{1} \\ = & \int_{0}^{1} g (F_{2}^{- 1} (Z_{2})) d Z_{2} - \int_{0}^{1} g (F_{1}^{- 1} (Z_{1})) d Z_{1} \\ = & \int_{0}^{0.5} g (F_{2}^{- 1} (Z_{2})) + g (F_{2}^{- 1} (1 - Z_{2})) d Z_{2} - \\ \int_{0}^{0.5} g (F_{1}^{- 1} (Z_{1})) + g (F_{1}^{- 1} (1 - Z_{1})) d Z_{1} \end{matrix}

(47)

There are two possibilities. Case A) Assume $F_{2}^{- 1} (1 - Z_{2}) < 0.5$ . Then, by linearity of g(X),

\frac{g (F_{2}^{- 1} (Z_{2})) + g (F_{2}^{- 1} (1 - Z_{2}))}{2} = \frac{g (F_{2}^{- 1} (Z_{2}) + F_{2}^{- 1} (1 - Z_{2}))}{2} = g (p) = 0.5 - p

(48)

Clearly, by similar logic,

\frac{g (F_{1}^{- 1} (Z_{1})) + g (F_{1}^{- 1} (1 - Z_{1}))}{2} \geq 0.5 - p = \frac{g (F_{2}^{- 1} (Z_{2})) + g (F_{2}^{- 1} (1 - Z_{2}))}{2}

(49)

Case B: $F_{2}^{- 1} (1 - Z_{2}) \geq 0.5$ . Then, clearly

\begin{array}{c} g (F_{1}^{- 1} (1 - Z_{1})) > g (F_{2}^{- 1} (1 - Z_{2})) \\ g (F_{1}^{- 1} (Z_{1})) > g (F_{2}^{- 1} (Z_{2})) \end{array}

(50)

Therefore, for case B we also see

\frac{g (F_{1}^{- 1} (Z_{1})) + g (F_{1}^{- 1} (1 - Z_{1}))}{2} \geq \frac{g (F_{2}^{- 1} (Z_{2})) + g (F_{2}^{- 1} (1 - Z_{2}))}{2}

(51)

Taking the expectation over both scenarios gives the desired result, E[g (X₂)] – E[g (X₁)] < 0.

5.4. Distribution of D_L_₂ (j) and $D_{L_{2}}^{*} (j)$

The distribution for D_L_₂ (j) and $D_{L_{2}}^{*} (j)$ can be described by the variances and covariances of terms in the following vector:

V_{1} = [Y_{i_{1}}, ε_{M}, Y_{i_{1}} Y_{i_{2}}, Y_{i_{1}} ε_{M}, Y_{i_{1}}^{2}, ε_{M}^{2}, Y_{i_{1}} Y_{i_{3}}, Y_{i_{2}} ε_{M}]'

(52)

by simple calculation, the var(V₁) can be described by:

(\begin{matrix} \frac{1}{2} p - \frac{1}{2} p^{2} & 0 & \frac{1}{2} p^{2} - \frac{1}{2} p^{3} - \frac{3}{4} p^{4} & 0 & \frac{1}{4} p + \frac{1}{4} p^{2} - \frac{1}{2} p^{3} & 0 & \frac{1}{2} p^{2} - \frac{1}{2} p^{3} & 0 \\ 0 & σ^{2} & 0 & p σ^{2} & 0 & 0 & 0 & p σ^{2} \\ \frac{1}{2} p^{2} - \frac{1}{2} p^{3} & 0 & \frac{1}{4} p^{2} + \frac{1}{2} p^{3} - \frac{3}{4} p^{4} & 0 & \frac{1}{4} p^{2} + \frac{1}{4} p^{3} - \frac{1}{2} p^{4} & 0 & \frac{1}{2} p^{3} - \frac{1}{2} p^{4} & 0 \\ 0 & p σ^{2} & 0 & (\frac{1}{2} p^{2} + \frac{1}{2} p) σ^{2} & 0 & 0 & 0 & p^{2} σ^{2} \\ \frac{1}{4} p - \frac{1}{4} p^{2} - \frac{1}{2} p^{3} & 0 & \frac{1}{4} p^{2} + \frac{1}{4} p^{3} - \frac{1}{2} p^{4} & 0 & \frac{1}{8} p + \frac{5}{8} p^{2} - \frac{1}{2} p^{3} - \frac{1}{4} p^{4} & 0 & \frac{1}{4} p^{2} + \frac{1}{4} p^{3} - \frac{1}{2} p^{4} & 0 \\ 0 & 0 & 0 & 0 & 0 & 2 σ^{4} & 0 & 0 \\ \frac{1}{2} p^{2} - \frac{1}{2} p^{3} & 0 & \frac{1}{2} p^{3} - \frac{1}{2} p^{4} & 0 & \frac{1}{4} p^{2} + \frac{1}{4} p^{3} - \frac{1}{2} p^{4} & 0 & \frac{1}{4} p^{2} + \frac{1}{2} p^{3} - \frac{3}{4} p^{4} & 0 \\ 0 & p^{2} σ^{2} & 0 & (\frac{1}{2} p^{2} + \frac{1}{2} p) σ^{2} & 0 & 0 & 0 & p σ^{2} \end{matrix})

In this section, as we focus on only a single locus, we have dropped the subscript ‘j’ from Y_{i j}, p_j and $σ_{j}^{2}$ . Next, by expanding the terms, and assuming n ≡ n_M = n_R, $σ_{R}^{2} = σ_{M}^{2}$ and β_R = β_M = 0, we get

\begin{array}{l} D_{L_{2}} (j) = \\ \frac{- \sum_{i : R_{i} = 1} 2 Y_{0} Y_{i}}{n} - 2 Y_{0} ε_{R} + \frac{\sum_{i : R_{i} = 1} \sum_{k : R_{k} = 1} Y_{i} Y_{k}}{n^{2}} + \frac{\sum_{i : R_{i} = 1} 2 Y_{i} ε_{R}}{n} + ε_{R}^{2} - \\ (\frac{- \sum_{i : M_{i} = 1} 2 Y_{0} Y_{i}}{n} - 2 Y_{0} ε_{M} + \frac{\sum_{i : M_{i} = 1} \sum_{k : M_{k} = 1} Y_{i} Y_{k}}{n^{2}} + \frac{\sum_{i : M_{i} = 1} 2 Y_{i} ε_{M}}{n} + ε_{M}^{2}) \end{array}

(53)

and

\begin{array}{l} var (D_{L_{2}} (j)) = (\frac{8}{n} + \frac{4 n (n - 1)}{n^{4}} var (Y_{1} Y_{2}) + (8 + \frac{8}{n}) var (Y_{1} ε_{M}) + \\ \frac{2}{n^{3}} var (Y_{1}^{2}) + 2 var (ε_{M}^{2}) + \\ (- \frac{16}{n} - \frac{8}{n^{2}} + \frac{16}{n^{2}}) cov (Y_{1} Y_{2}, Y_{1} Y_{3}) + \\ (- \frac{8}{n^{3}}) cov (Y_{1} Y_{2}, Y_{1}^{2}) + (- 8 - \frac{- 8}{n}) cov (Y_{1} ε_{M}, Y_{2} ε_{M}) \end{array}

(54)

By substituting the appropriate values from the variance matrix for V₁, we get the variance given in section 2.3. Using the same concept of expanding the terms and straight forward calculation, we were able to find covar( $D_{1 L_{2}}^{*}, D_{2 L_{2}}^{*}$ ).

References

Balding David J. Likelihood-based inference for genetic correlation coefficients. Theoretical Population Biology. 2003;63(3):221–230. doi: 10.1016/S0040-5809(03)00007-8. [DOI] [PubMed] [Google Scholar]
Couzin Jennifer. Genetic privacy: Whole genome data not anonymous, challenging assumptions. Science. 2008 Sep;321(5894):1268–1374. doi: 10.1126/science.321.5894.1278. [DOI] [PubMed] [Google Scholar]
Foreman LA, Champod C, Evett JA, Lambert S Pope. Interpreting dna evidence: A review. International Statistical Review. 2003;71:473–495. [Google Scholar]
Fung Wing K, Hu Yue-Qing. Evaluating mixed stains with contributors of different ethnic groups under the nrc-ii recommendation 4.1. Statistics in Medicine. 2002 Nov;21:3583–3593. doi: 10.1002/sim.1313. [DOI] [PubMed] [Google Scholar]
Homer Nils, Szelinger Szabolcs, Redman Margot, Duggan David, Tembe Waibhav, Muehling Jill, Pearson John V, Stephan Dietrich A, Nelson Stanley F, Craig David W. Resolving individuals contributing trace amounts of dna to highly complex mixtures using high-density snp genotyping microarrays. PLoS Genet. 2008 Aug;4(8):e1000167. doi: 10.1371/journal.pgen.1000167. [DOI] [PMC free article] [PubMed] [Google Scholar]
Jobling Mark A, Gill Peter. Encoded evidence: Dna in forensic analysis. Nat Rev Genet. 2004 Oct;5:739–751. doi: 10.1038/nrg1455. [DOI] [PubMed] [Google Scholar]
Kidd Kenneth K, Pakstis Andrew J, Speed William C, Grigorenko Elena L, Kajuna Sylvester LB, Karoma Nganyirwa J, Kungulilo Selemani, Kim Jong-Jin, Lu Ru-Band, Odunsi Adekunle, Okonofua Friday, Parnas Josef, Schulz Leslie O, Zhukova Olga V, Kidd Judith R. Developing a snp panel for forensic identification of individuals. Forensic Science International. 2006;164(1):20–32. doi: 10.1016/j.forsciint.2005.11.017. [DOI] [PubMed] [Google Scholar]
Macgregor Stuart, Zhao Zhen Zhen, Henders Anjali, Nicholas Martin G, Montgomery Grant W, Visscher Peter M. Highly cost-efficient genome-wide association studies using dna pools and dense snp arrays. Nucleic Acids Res. 2008;36:e35. doi: 10.1093/nar/gkm1060. [DOI] [PMC free article] [PubMed] [Google Scholar]
Pearson JV, Huentelman MJ, Halperin RF, Tembe WD, Melquist S, et al. Identification of the genetic basis for complex disorders by use of poolingbased genomewide single-nucleotide-polymorphism association studies. American Journal of Human Genetics. 2007;80:126–139. doi: 10.1086/510686. [DOI] [PMC free article] [PubMed] [Google Scholar]
Stoneking M, Hedgecock D, Higuchi RG, Vigilant L, Erlich HA. Population variation of human mtdna control region sequences detected by enzymatic amplification and sequence-specific oligonucleotide probes. American Journal of Human Genetics. 1991;48(2):370–382. [PMC free article] [PubMed] [Google Scholar]
The International HapMap Consortium The international hapmap project. Nature. 2003 Dec;426:789–796. doi: 10.1038/nature02168. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

suppmaterial.pdf

additional power comparison

Click here for additional data file.^{(131.9KB, pdf)}

suppmaterial.pdf

additional power comparison

[b1-sagmb1469] Balding David J. Likelihood-based inference for genetic correlation coefficients. Theoretical Population Biology. 2003;63(3):221–230. doi: 10.1016/S0040-5809(03)00007-8. [DOI] [PubMed] [Google Scholar]

[b2-sagmb1469] Couzin Jennifer. Genetic privacy: Whole genome data not anonymous, challenging assumptions. Science. 2008 Sep;321(5894):1268–1374. doi: 10.1126/science.321.5894.1278. [DOI] [PubMed] [Google Scholar]

[b3-sagmb1469] Foreman LA, Champod C, Evett JA, Lambert S Pope. Interpreting dna evidence: A review. International Statistical Review. 2003;71:473–495. [Google Scholar]

[b4-sagmb1469] Fung Wing K, Hu Yue-Qing. Evaluating mixed stains with contributors of different ethnic groups under the nrc-ii recommendation 4.1. Statistics in Medicine. 2002 Nov;21:3583–3593. doi: 10.1002/sim.1313. [DOI] [PubMed] [Google Scholar]

[b5-sagmb1469] Homer Nils, Szelinger Szabolcs, Redman Margot, Duggan David, Tembe Waibhav, Muehling Jill, Pearson John V, Stephan Dietrich A, Nelson Stanley F, Craig David W. Resolving individuals contributing trace amounts of dna to highly complex mixtures using high-density snp genotyping microarrays. PLoS Genet. 2008 Aug;4(8):e1000167. doi: 10.1371/journal.pgen.1000167. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b6-sagmb1469] Jobling Mark A, Gill Peter. Encoded evidence: Dna in forensic analysis. Nat Rev Genet. 2004 Oct;5:739–751. doi: 10.1038/nrg1455. [DOI] [PubMed] [Google Scholar]

[b7-sagmb1469] Kidd Kenneth K, Pakstis Andrew J, Speed William C, Grigorenko Elena L, Kajuna Sylvester LB, Karoma Nganyirwa J, Kungulilo Selemani, Kim Jong-Jin, Lu Ru-Band, Odunsi Adekunle, Okonofua Friday, Parnas Josef, Schulz Leslie O, Zhukova Olga V, Kidd Judith R. Developing a snp panel for forensic identification of individuals. Forensic Science International. 2006;164(1):20–32. doi: 10.1016/j.forsciint.2005.11.017. [DOI] [PubMed] [Google Scholar]

[b8-sagmb1469] Macgregor Stuart, Zhao Zhen Zhen, Henders Anjali, Nicholas Martin G, Montgomery Grant W, Visscher Peter M. Highly cost-efficient genome-wide association studies using dna pools and dense snp arrays. Nucleic Acids Res. 2008;36:e35. doi: 10.1093/nar/gkm1060. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b9-sagmb1469] Pearson JV, Huentelman MJ, Halperin RF, Tembe WD, Melquist S, et al. Identification of the genetic basis for complex disorders by use of poolingbased genomewide single-nucleotide-polymorphism association studies. American Journal of Human Genetics. 2007;80:126–139. doi: 10.1086/510686. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b10-sagmb1469] Stoneking M, Hedgecock D, Higuchi RG, Vigilant L, Erlich HA. Population variation of human mtdna control region sequences detected by enzymatic amplification and sequence-specific oligonucleotide probes. American Journal of Human Genetics. 1991;48(2):370–382. [PMC free article] [PubMed] [Google Scholar]

[b11-sagmb1469] The International HapMap Consortium The international hapmap project. Nature. 2003 Dec;426:789–796. doi: 10.1038/nature02168. [DOI] [PubMed] [Google Scholar]

PERMALINK

Identifying Individuals in a Complex Mixture of DNA with Unknown Ancestry

Joshua Sampson

Hongyu Zhao

Abstract

1. Introduction

2. Methods

2.1. Notation

2.2. DL1 : The Original Test

2.2.1. DL1 : Bias from Ancestry

Table 1:

Figure 1:

2.2.2. DL1 : Bias from Sample Size

Inequality of Absolute Values:

Figure 2:

2.2.3. DL1: Bias from Platform

2.3. DL2 : Switching from L1 to L2

2.4. DL2*: A New Statistic

Step 1:

Step 2:

Step 3:

Step 4:

Step 5:

2.5. Data and Simulations

3. Results

3.1. Comparison of Power

Figure 3:

3.2. Examination of False Positive Rate

Figure 4:

4. Discussion

Supplementary Material

Figure 5:

Acknowledgments

5. Appendix

5.1. Distribution of DL1 (j)

5.2. Estimating when μ1 = μ̄f p

5.3. Inequality of Absolute Values (proof)

5.4. Distribution of DL2 (j) and DL2*(j)

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

2.2. D_L_₁ : The Original Test

2.2.1. D_L_₁ : Bias from Ancestry

2.2.2. D_L_₁ : Bias from Sample Size

2.2.3. D_L_₁: Bias from Platform

2.3. D_L_₂ : Switching from L₁ to L₂

2.4. $D_{L_{2}}^{*}$ : A New Statistic

5.1. Distribution of D_L₁ (j)

5.2. Estimating when μ₁ = μ̄_{f p}

5.4. Distribution of D_L_₂ (j) and $D_{L_{2}}^{*} (j)$